[00:00:08] hi nuria sorry [00:00:51] madhuvishy: that's totally ok with me, but then we might also want to establish better practices for purging data (I currently do it manually once in a while) [00:00:54] Analytics-Backlog: Pageview API: Better filtering of bot traffic on top enpoints - https://phabricator.wikimedia.org/T123442#1929942 (Nuria) NEW [00:01:16] milimetric: hmmm [00:01:49] okay, i'm not sure how that'll work - but for now i'm gonna set up prod db to point to labs db [00:02:04] and staging will have local dbs and we can also do the testing dbs [00:05:32] madhuvishy: but then won't everyone with labsdb access have access to the db? [00:06:00] milimetric: no it'll be created with out labsdb creds [00:06:05] our [00:06:08] ok, good, then +2 [00:07:12] milimetric: great then [00:16:14] (PS3) Madhuvishy: Update config files paths in wsgi file to /srv [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/261604 [00:16:49] (CR) jenkins-bot: [V: -1] Update config files paths in wsgi file to /srv [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/261604 (owner: Madhuvishy) [00:22:26] (CR) Madhuvishy: "recheck" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/261604 (owner: Madhuvishy) [00:24:17] Analytics-Cluster, Analytics-Kanban, Datasets-Archiving, Datasets-Webstatscollector: Mediacounts missing top1000 files after 2016-01-01: rsync fails - https://phabricator.wikimedia.org/T122864#1930028 (ezachte) Open>Resolved Well everything did get synced in the end. My assumption on require... [00:30:22] (PS4) Madhuvishy: Update config files paths in wsgi file to /srv [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/261604 [00:31:17] (CR) jenkins-bot: [V: -1] Update config files paths in wsgi file to /srv [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/261604 (owner: Madhuvishy) [00:34:16] (PS5) Madhuvishy: Change config path in wsgi file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/261604 [00:35:16] milimetric: i have submitted the puppet patch, let me see about the other one [00:35:16] (CR) jenkins-bot: [V: -1] Change config path in wsgi file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/261604 (owner: Madhuvishy) [00:35:32] ok jenkins doesn't like me [00:42:50] (Abandoned) Madhuvishy: Change config path in wsgi file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/261604 (owner: Madhuvishy) [00:44:00] (PS1) Madhuvishy: Change config path in wsgi file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/263782 [00:44:54] (CR) jenkins-bot: [V: -1] Change config path in wsgi file [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/263782 (owner: Madhuvishy) [00:46:27] Analytics-Tech-community-metrics, Developer-Relations, DevRel-January-2016: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1930108 (Aklapper) >>! In T103292#1892542, @Nemo_bis wrote: > Maybe make an year-long l... [00:49:46] nuria, hi [00:50:10] Krenair: hola [00:51:00] nuria, I just saw piwik and noticed it has basic restricted login, requiring the normal groups but also it's own auth [00:51:10] Krenair: aham [00:51:12] does it have it's own account system etc.? [00:51:41] Krenair: its own account? you mean user/pw? [00:52:01] yes [00:54:09] nuria / milimetric: https://gerrit.wikimedia.org/r/#/c/263786/ [00:55:25] Krenair: it requires a valid ldap login [00:55:40] oh, yeah, there is an authentication layer below it too [00:56:04] (PS5) Madhuvishy: Fabric deployment setup for wikimetrics [analytics/wikimetrics-deploy] - https://gerrit.wikimedia.org/r/261579 [00:56:04] the login form rejects my ldap credentials [00:56:38] yeah, i misread your question. in addition to ldap authentication, there is another layer of auth [00:56:56] right [00:57:03] what is this extra layer of auth? piwik's internal system? [00:57:08] yeah [00:57:32] Krenair: sorry i missunderstood too, yes there is an extra account for admins [00:57:36] of [piwik [00:57:39] *piwik [00:58:02] piwik as a webapp is oriented around user accounts. so, we need that. the reason for putting *that* behind mod_authnz_ldap is so that we are not reliant on the security of piwik's authentication code [00:58:22] I see [00:58:36] this should probably be documented on piwik, along with who administrates it [00:58:39] on wikitech* [00:58:40] oops [00:59:07] yeah, good point [00:59:58] for now I have done https://wikitech.wikimedia.org/w/index.php?title=LDAP_Groups&diff=255189&oldid=199723 [01:00:51] Krenair: do not worry, once it actually works we will document it [01:00:57] Krenair: it doesn't yet [01:01:45] ok [01:01:47] cool [01:01:53] thanks nuria + ori [01:03:50] Analytics: Pageview API: Better filtering of bot traffic on top enpoints - https://phabricator.wikimedia.org/T123442#1930191 (madhuvishy) [01:07:13] a-team: the Analytics-Backlog board has been archived! Use Analytics from now on - backlog won't show up when adding the tag, etc. [01:07:40] \o/ [01:09:13] Analytics-Kanban, Patch-For-Review: Piwik beacon on prod instance should be accessible [5 pts] - https://phabricator.wikimedia.org/T123260#1930222 (Nuria) Turns out that patch had also been deployed by ori: https://gerrit.wikimedia.org/r/#/c/263786/ [01:11:20] madhuvishy: heh, i was sure 'MtDu' was your alter-ego [01:11:29] because 'MtDu' sounds like madhu [01:11:50] ori: ha ha [01:11:54] who is it [01:12:53] justin.d128@gmail.com, whoever that is [01:13:49] ori: ah no idea. that would have been a good irc nick for me though [01:14:36] (CR) Madhuvishy: "Not sure why Jenkins hates me for this patch!" [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/263782 (owner: Madhuvishy) [01:24:05] (PS6) Madhuvishy: Fabric deployment setup for wikimetrics [analytics/wikimetrics-deploy] - https://gerrit.wikimedia.org/r/261579 (https://phabricator.wikimedia.org/T122228) [01:26:26] Analytics-Kanban, Patch-For-Review: Piwik beacon on prod instance should be accessible [5 pts] - https://phabricator.wikimedia.org/T123260#1930287 (Nuria) a:Milimetric>Nuria [01:43:31] Analytics, Wikimedia-Mailing-lists: home page for the analytics mailing list should link to gmane - https://phabricator.wikimedia.org/T116740#1930324 (Krenair) a:kevinator [02:47:37] Analytics, Wikimedia-Mailing-lists: home page for the analytics mailing list should link to gmane - https://phabricator.wikimedia.org/T116740#1930416 (kevinator) a:kevinator>mforns @mforns, can you look into this? [08:00:41] Analytics-Tech-community-metrics, DevRel-January-2016: Make GrimoireLib display *one* consistent name for one user, plus the *current* affiliation of a user - https://phabricator.wikimedia.org/T118169#1930654 (Lcanasdiaz) This is crearly a bug in GrimoireLib, I'm working on it to fix the Organization disp... [09:26:59] Analytics-Tech-community-metrics, Developer-Relations, DevRel-January-2016: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1930685 (Nemo_bis) I'm not sure I follow your question but I'll try to rephrase my ques... [10:23:46] ls -lah [12:50:36] Analytics: Daily/monthly aggregation of hourly page view files halted - https://phabricator.wikimedia.org/T123477#1930889 (ezachte) NEW a:ezachte [14:18:26] morninng! [14:25:43] Helllllo [14:26:30] hey guys :] [14:26:53] Hey mforns :) [14:29:48] Analytics-Cluster, Analytics-Kanban, Datasets-Archiving, Datasets-Webstatscollector: Mediacounts missing top1000 files after 2016-01-01: rsync fails - https://phabricator.wikimedia.org/T122864#1931292 (Ottomata) An immediate update on dumps.wikimedia.org? The files there are rsynced out of HDFS... [14:42:57] hey joal :] yt? [14:43:09] Hey mforns [14:43:13] I'm here yes [14:43:48] Wassup mforns ? [14:44:07] joal, I finished the EL queue thing, and I was wondering what could I do next. I remembered we spoke about me finishing the entropy calculations on the anonymization [14:44:28] mforns: sounds a cool idea :) [14:44:31] ok [14:44:49] qq: do we have an example of anonymized table? [14:45:03] mforns: I need to collect my thoughts around that though (i'm on stats for review now) [14:45:07] give me 2 mins [14:45:34] stats for review? [14:45:51] sure! [14:46:01] cluster usage and pageview_api usage :) [14:46:13] oh! yes of course [14:46:40] joal, I will go grab some lunch and be back in 30 mins [14:46:52] mforns: I shall have some dataz for you :) [14:47:03] cool, thx, see you in a bit [14:53:17] ottomata: Hey ! Have a minute ? [14:53:24] ja in 3 minutes... :) [14:53:28] :) [14:57:03] yes oOOok [14:57:05] joal: wassuppppp [14:57:25] hm, want to discuss oozie [14:57:51] ja? [14:57:57] ottomata: awkwardl response times, (both hue and cli), I wonder if there is anything we should do [14:58:11] For instance, currently unresponsive [14:58:57] And I wonder if some of the issue I have fixed yesterday (jobs stuck in suspended, needed manual restart) is linked or not [14:59:00] ottomata: --^ [14:59:21] hmm [15:00:51] joal: not entirely sure, i see what you mean though. there is a long stalled project to move hive and maybe oozie to a beefier machine [15:00:58] gonna restart oozie server and see if it helps [15:01:15] ottomata: ok [15:01:38] ottomata: I'd love to prioritize the "beefy machine project" for soon ;) [15:01:46] aye [15:01:59] it mainly just requires some hive downtime [15:02:44] ottomata: I have had the thougt: Let's make that while upgrading - And then thought - One chnge at a time :) [15:02:56] heh, nawww [15:02:57] yeah [15:03:04] no it should be a short hive downtime [15:03:13] mostly so that we can stop the existing hive server, change configs, and start up the new one [15:03:21] the mysql db is already being replicated there [15:03:24] to the new server [15:03:41] ottomata: I'm your backup eyes whenever needed for that, we can send an email about cluster downtime (cause hive downtime means almsot everythiong doewntime) [15:04:01] k [15:04:03] yeah [15:04:23] ok, restarted oozie [15:04:27] things seem a little more responsive [15:04:49] at least for now [15:04:58] ottomata: I'll double check jobs are ok [15:05:07] Thanks for the restart ottomata [15:07:21] yup [15:07:46] hey CooOol with the cluster usage joal! [15:07:52] how'd you get that info? [15:07:59] oozie? [15:08:13] hive db? [15:09:03] ottomata: listing folders in hadoop history :) [15:09:12] haha, awesome :) [15:09:40] ottomata: I'd have liked to have better precision (as with the history server), but we only keep 7 days of history in the server :) [15:10:24] ottomata: I think the high level view is correct though (might not be when massively using spark though ;) [15:16:21] joal, back [15:16:55] Hey mforns [15:17:19] two tables you are interested in: joal.pv_to_anon and joal.pv_anon [15:17:22] mforns: --^ [15:17:27] aha [15:18:10] mforns: these table contains more that just interesting data (namely, IPs for check) [15:18:20] mforns: you probably need to rework the thing a bit [15:18:30] aha [15:18:48] * mforns looks [15:23:35] joal, is there any difference between those tables? [15:23:52] are both generated by the same code? [15:24:23] Ah mforns : pv_to_anon is the original data, pv_anon is the anonymized one [15:24:29] I hope there diffs mforns :) [15:24:54] joal, I see :] [15:25:27] joal, pv_to_anon is a copy of pageview_hourly, but with the unique_ips added right? [15:26:01] mforns: I think so [15:26:17] mforns: I am sure :) [15:26:37] mforns: only exception: no date fields [15:26:52] and pv_anon is the same table anonymized no? the unique_ips field is aggregated I guess? Is it the distinct union of the anonymized buckets? [15:26:56] I see [15:26:57] ok ok [15:28:26] date fields would be the exact same in all rows (because it is an hour of data no?) so no entropy at all [15:35:12] mforns: correct ! [15:35:20] mforns: sorry I was on another chan [15:35:48] np joal will try to get something :] [15:36:53] awesome, thx mforns :) [15:51:08] Analytics: Pageview API: Better filtering of bot traffic on top enpoints - https://phabricator.wikimedia.org/T123442#1931393 (JAllemandou) Did a quick check this morning: - Top end point doesn't contain what we flag as bots (namelly spiders), it only contains what we flag "user". - Double checked pages "J... [15:52:13] mforns: hola [15:52:19] nuria, hi! [15:52:24] mforns: rather than work more on anonymization [15:52:29] aha [15:52:42] mforns: i would prefer to tackle the mobile jobs changes for reading [15:52:55] mforns: as they have been waiting for those for a while [15:53:04] ok, sure [15:53:21] mforns: this one: https://phabricator.wikimedia.org/T117615 [15:53:21] * mforns looks at the kanban [15:53:31] mforns: madhuvishy or myself can provide context [15:54:09] mforns: but we want to keep old jobs running as they are plus create a new one that runs every 7 days and splits metrics by ios and android [15:54:29] nuria, can you paste a link to the task plase? [15:55:09] mforns: https://phabricator.wikimedia.org/T117615 [15:55:37] Analytics-Kanban: Gather preliminary metrics of Pageview API usage for quaterly review {slug} - https://phabricator.wikimedia.org/T120845#1931400 (JAllemandou) Draft [[ https://docs.google.com/a/wikimedia.org/spreadsheets/d/1Jm6s25e0T1npXhfM5fVtrvuC-4LM8D9BmcllUHcKlek/edit?usp=sharing | here ]] [15:56:06] nuria, cool, will look into this, thanks for the idea [15:56:48] Analytics-Kanban: Gather preliminary metrics of Pageview API usage for quaterly review {slug} [5pts] - https://phabricator.wikimedia.org/T120845#1931401 (JAllemandou) [15:56:51] mforns: thank you, need to look at CR of EL but iam afraid today is going to be piwik stuff as financial report is launching tomorrow [15:57:18] nuria, I think there's no rush for EL changes [15:57:34] at all [15:57:34] mforns: am hoping to get to those tomorrow [15:57:38] Analytics-Kanban: Gather metrics about cluster usage - https://phabricator.wikimedia.org/T121783#1931403 (JAllemandou) Draft [[https://docs.google.com/a/wikimedia.org/spreadsheets/d/1ePGLjukMcriHm92h8N25NU5DcsL1-wXhawb_PT2TazQ/edit?usp=sharing | here]] [15:57:46] ok [15:57:55] Analytics-Kanban: Gather metrics about cluster usage {hawk} [5 pts] - https://phabricator.wikimedia.org/T121783#1931404 (JAllemandou) [16:01:40] Analytics-Kanban, Patch-For-Review: Add piwiki beacon to financial report website [5] - https://phabricator.wikimedia.org/T123263#1931434 (Nuria) a:Nuria [16:18:25] joal: question for ya if you have 2 mins [16:18:35] sure nuria [16:18:37] tell me [16:19:10] joal ; in jobs such us last access where results from run are Joined with existing results on a table, for example: [16:19:23] I know exactl [16:19:36] k [16:19:43] joal: how do you handle reruns? [16:19:49] nuria: This is an aspect of the code I anted to discuss as well : is this list a good idea to join with the rest ? [16:20:08] joal: I think this is how it is fdone on our current mobile jobs , let me see [16:20:23] To handle rerun, you'd filter the existing data to remove already existing data for the given date [16:21:00] nuria: It is done this way on the mobile job - The reason was because data is VERY small [16:21:07] joal: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mobile_apps/uniques/monthly/generate_uniques_monthly.hql [16:21:13] ah sorry [16:21:27] nuria: On this example, data will be bigger (1 row per project per country) [16:21:30] joal: in this case for daily jobs data is also small 800 rows max [16:21:52] joal: maybe (if country calculation is possible) [16:22:04] Why not join then :) [16:22:29] joal: wait, join how? [16:22:47] And to handle rerun, in the union, when gathering exsiting data, filter to remove the currently worked date [16:23:09] nuria: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mobile_apps/uniques/monthly/generate_uniques_monthly.hql#L86 [16:23:54] nuria: Sorry, when saying join, I meant bundle [16:24:13] joal: i see, that IS the way to handle reruns [16:24:28] cause you are forgetting about data that might exist on prior run [16:24:28] If you think it's small enough to be bundled alltogether and alter too much readability / searchability, easy :) [16:24:43] correct nuria [16:24:57] joal: k i get it now, cc madhuvishy so she can see this later [16:25:00] This was actually a bug we had in the first place, not to have done it this way :) [16:25:19] And corrected it after :) [16:36:26] joal: ya, i see, i was thinking about this early on [16:47:44] (CR) Nuria: "I see, let's make sure to document how to deploy. I am not clear on who runs the fabric scripts to deploy." [analytics/wikimetrics-deploy] - https://gerrit.wikimedia.org/r/261579 (https://phabricator.wikimedia.org/T122228) (owner: Madhuvishy) [17:00:40] joal: standddup? [17:00:45] Yeeees [17:00:46] sorry [17:01:38] Analytics-Kanban: Provide weekly app session metrics separately for Android and iOS, and move to 7 day counts [13 pts] - https://phabricator.wikimedia.org/T117615#1931575 (mforns) a:mforns [17:04:05] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-v5-production: Support Piwik in production - https://phabricator.wikimedia.org/T116308#1931579 (Nuria) [17:24:44] Analytics-Tech-community-metrics, Developer-Relations, DevRel-January-2016: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#1931623 (Aklapper) > we have found no proof for the truth or falsehood of the statement... [17:56:58] Analytics, Ops-Access-Requests, operations: add mforns, milimetric, nuria,ottomata, madhuvishy and joal to piwik-roots - https://phabricator.wikimedia.org/T122325#1931672 (RobH) Meetign result: We approved the analytics group folks in this request to add mforns, milimetric, nuria,ottomata, madhuvishy... [18:03:43] o/ [18:04:03] joal: would it be useful to add a note in the spreadsheet about where the data comes from? [18:05:15] (Cluster Usage) [18:07:06] * elukey has at least 10 things to clarify from the Ops meeting, plus all the Analytics ones [18:08:12] at some point the function's slope will stop growing exponentially, not sure when :D [18:38:20] Analytics-Tech-community-metrics, DevRel-January-2016: Make GrimoireLib display *one* consistent name for one user, plus the *current* affiliation of a user - https://phabricator.wikimedia.org/T118169#1931836 (Lcanasdiaz) Patch being reviewed https://github.com/VizGrimoire/GrimoireLib/pull/67 [18:40:30] Analytics-Kanban: Prepare presentation quaterly review - https://phabricator.wikimedia.org/T123528#1931840 (Nuria) NEW a:madhuvishy [18:54:59] okay, the eventlogging tables on analytics-store are still massively backlogged [18:55:18] I'd like to know when I can expect them to not be; our daily reporting is broken and I am unable to validate an A/B test we have just launched. [18:55:58] Ironholds: a bit of bad timing i think. Jaime (or someone else?) has set up some weird non standard custom mysql replication for eventlogging [18:56:24] aside from a commit (that says it was brought into puppet whereas it was just in some cron before) there isnt' really any documentation [18:56:30] and [18:56:33] jaime is on vacation this week [18:57:05] yeah, I get both of those bits. But does that mean (a) replication cannot be fixed for the time being or (b) we don't know how long until replication will be fixed? [18:59:03] i think b, i'm not even sure who to ask or what to do about it. we can ask in #ops if you like. i poked at it yesterday, and I could see the slave was pretty busy, but not so different than it has been in the past [19:00:03] ottomata, yeah, that would be really appreciated. This is a massive blocker on..pretty much all the work my team has. [19:00:11] alternately we can bug YuviPanda [19:00:20] there are some really long running queries I see [19:00:36] CREATE TEMPORARY TABLE staging.view_count ... [19:00:37] is that you [19:00:38] ? [19:00:56] it is not. lzia, halfak|Lunch , you know anything about that? [19:01:02] (when I think staging I think R&D ;p) [19:01:21] i mean, i doubt that is hte problem though. its the longest query [19:01:23] been running for over a day [19:01:27] *nod* [19:01:30] but, this has been a problem for longer than that [19:01:39] but mostly it's "we have a lot of tables and nobody really understands how they get populated" [19:01:43] its hard for me to even see how far the lag is, since this is custom replication [19:01:51] ottomata, that's me. [19:01:52] i mean, i can read the script [19:01:58] and understand it [19:02:02] Sorry for the trouble. [19:02:03] :/. In the longer-term we should fix that. It's been a problem far more than once. [19:02:07] halfak: i doubt it is your fault [19:02:09] halfak, naw, it's not the source of the issue! [19:02:10] If it's blocking replication, I can kill it. [19:02:13] was just wondering [19:02:14] OK [19:02:22] it probably isn't helping, but i thikn there is something bigger wrong here [19:02:37] I've had this query complete before FWIW. [19:04:11] yeah, let it go [19:04:22] disk util on this box is 100% [19:04:23] ottomata, backup plan; how do I get access to the raw events stream/the non-analytics-store db it writes to? Then I can at least validate the A/B [19:04:28] uh-oh [19:04:43] Ironholds: the events are all in hdfs [19:04:45] if you like [19:04:53] oh, they are?! SWEET [19:04:55] or you can get them in files on stat1002/3 occasionalyl synced [19:04:58] or you can just subscribe to kafka [19:04:59] ja [19:05:02] Ironholds: sorry but hdfs is the best we can do [19:05:07] * Ironholds runs joyously towards hive [19:05:09] Ironholds: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Access_data_in_Hadoop [19:05:18] its a little easier in spark, but ja hive will work too [19:05:20] nuria, you're apologising for them being in a format I know how to use. No apology necessary :D [19:05:31] Ironholds: we all suffer from having only one DBA really, poor jaime is overworked just like sean used to be [19:06:11] ottomata: can we drop a vchunck of the data on master on teh table we just blacklisted? [19:06:21] yeah :( [19:06:27] ottomata: and thus reduce data that it needs to be replicated? [19:06:35] oh, maybe? [19:06:39] what table? [19:06:48] ottomata: let's try it, let me see as of name [19:07:04] what's the master db? [19:07:12] oh wait i know [19:07:26] ottomata: ya , the EL way [19:07:36] ottomata: MobileWebSectionUsage [19:08:03] ottomata: that table is huge i think and I believe unqueriable we can drop data and noted as such cc jdlrobson [19:08:42] nuria: we blacklisted that in el? [19:08:52] from valid-mixed topic? [19:08:53] ottomata: yes, aftert a storm of events [19:09:09] ottomata: right, we didi so yesterday: ./hieradata/common.yaml:eventlogging_valid_mixed_schema_blacklist: ^Analytics|CentralNoticeBannerHistory|MobileWebSectionUsage$ [19:09:10] cool [19:09:11] ja [19:09:12] ok [19:09:17] ottomata: cause it was loading 100 events per sec [19:09:22] ottomata: thus must be huge [19:09:27] how long ago though? [19:09:30] did we do that? [19:09:45] ottomata: blacklisting ? yesterday , i'd say 24hrs ago [19:09:51] oh ok [19:10:03] ottomata: Mon Jan 11 14:46:18 2016 -0800 [19:10:22] ottomata: sorry, must have been day before yesterday [19:10:53] k [19:11:01] ottomata: a shot in teh dark [19:11:12] ottomata: but the size must be an issue [19:12:15] oh yeah it is 2 days behind [19:12:18] MobileWebSectionUsage_15038458 is [19:12:27] actually they both are [19:12:33] both MobileWebSectionUsage tables [19:12:37] so, nuria should I just drop the tables? [19:12:41] from the master? [19:13:10] ottomata: i'd say drop those tables since data is in hadoop right? [19:13:17] sure, ok. [19:13:25] hoping dropping doesn't break things... [19:13:38] ottomata: in EL? [19:13:42] ottomata: or the db [19:13:49] ottomata: i can speak for EL [19:14:02] ottomata: as i have done it before and it causes no issue [19:14:04] the db [19:14:27] ottomata: it shouldn't right? [19:14:31] I would 1) stop EL [19:14:36] no, EL will be fine [19:14:45] just not sure about how intensive dropping the table is for mysql master [19:14:45] 2) rather stop EL mysql consumer [19:14:47] looks ok [19:14:54] well, the files on disc have embedded nuls [19:15:01] oh good. [19:15:06] * Ironholds loads up hive [19:15:20] ottomata: as of mysql i cannot speak though [19:15:21] nuria: want me to just wait til you are back around? [19:15:27] in 20 mins? [19:15:37] springle: maybe you are there...? [19:16:03] ottomata: for mysql i am going to be of little help [19:16:25] ottomata: i'd say ask in ops and if nobody thinks is crazy we should do it [19:16:43] ottomata: in the absence of a better idea? [19:17:23] k, i will ask, but prob still wait for you, i just want someone around to make me feel bettter :) [19:18:24] ottomata: ok, let's do it [19:19:27] nuria: , i wait, yes? [19:19:30] when you get back? [19:19:40] ottomata: nah, i will be here, not moving for now [19:19:40] am asking in mw sec [19:19:44] oh ok [19:21:13] hmm, nuria [19:21:17] there are new events for these tables on the master [19:21:17] yes [19:21:23] ja events coming in [19:21:27] as of today [19:21:28] ? [19:21:39] select max(timestamp) from MobileWebSectionUsage_15038458; [19:21:41] one min [19:21:42] was [19:21:44] 20160113192030 [19:21:47] now is [19:21:48] 20160113192051 [19:23:12] ottomata: are the hieradata changes deployed then? [19:23:28] ottomata: EL bneed s to be restarted right? [19:23:34] ottomata: i bet that did not happen [19:23:39] ah, yeah it does [19:23:44] ottomata: well.. [19:23:46] yeah its in the config file [19:24:10] ok, i restart eventlogging [19:24:11] madhuvishy: did you guys restarted EL when banning teh mobile schema? [19:24:24] oh [19:24:34] ottomata: so restarting should stop gathering of those events [19:24:47] ottomata: and once that is happening then dropping tables will help [19:24:48] !log restarting eventlogging to apply blacklist of MobileWebSectionUsage scheas [19:24:48] i din't - thought puppet run would do it [19:24:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [19:24:57] madhuvishy: naw, puppet can't be trusted to do that :) [19:25:05] right okay [19:25:11] ottomata, the EL exaple does not work for me; I don't have WRITE permissions for the 'ironholds' db apparently [19:25:12] either: 1 it isn't supposed to (i wouldn't have it do it) [19:25:16] or 2: it doesn't work [19:25:23] i've always had to restart el for config changes [19:25:28] !? [19:25:30] ottomata: ya, same here [19:25:32] Ironholds: with you shortly [19:25:38] kk [19:25:47] ottomata: re-starting EL for confifg changes that is [19:25:49] ok, looking better, no more MobileWebSectionUsage in all-events.log [19:25:51] *config [19:26:04] ottomata: ok, then we should be ok to drop tables [19:27:29] ottomata: let me know [19:28:48] am dropping... [19:28:50] from m4-master [19:28:54] ottomata: k [19:31:11] drop of MobileWebSectionUsage_14321266 succeeded fine [19:31:16] drop of MobileWebSectionUsage_15038458 taking a while... [19:31:23] ottomata: makes sense [19:33:34] uh oh [19:33:49] seeing some nasty looking disk errors i thikn [19:33:51] on m4-master [19:34:08] ottomata: argh [19:34:15] things seem to be working.... [19:34:20] el is inserting, or so it says it is [19:34:28] (PS1) Wassan.anmol: Updated result of validation after creating cohort. [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/263911 [19:34:32] ottomata: errors as of size? [19:34:46] actually might not be a disk error, just saw some DIMM things [19:34:49] Jan 13 19:34:30 dbproxy1004 kernel: [29033745.569501] EDAC MC0: 146 CE error on CPU#0Channel#0_DIMM#0 (channel:0 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0) [19:34:57] ottomata: k [19:35:08] googling [19:35:09] ottomata: so you restarted everything [19:35:10] Memory Correctable Errors (CE) [19:35:15] el, eys [19:36:24] nuria: fyi, these DIMM errors have been happening for days [19:37:09] ottomata: ok, let's keep an eye and check in an hour for replication lag right? [19:37:17] ottomata: sounds good? [19:38:59] the drop table is still running... [19:39:02] i'm going to let it go [19:39:08] i think there is a bad mem chip on this master box [19:39:18] or slot perhaps [19:39:24] making a ticket and CCing chris J [19:41:46] ottomata: ok [19:45:07] Analytics, operations, ops-eqiad: Possible bad mem chip or slot on dbproxy1004 - https://phabricator.wikimedia.org/T123546#1932143 (Nuria) [19:48:34] ottomata: will be checking back in a bit [19:49:55] Ironholds: ok [19:49:56] whasssup? [19:53:31] Ironholds: afact you should be able to write in ironholds db [19:54:59] Ironholds: that worked for me just fine [19:55:01] that example [19:55:03] as your user too :) [19:57:12] ottomata, I'll upload my current code then and we can see what the difference is [19:58:11] ottomata, https://github.com/wikimedia-research/PortalSearchBoxTest/blob/master/data_validity.R#L10-L24 these lines error [20:01:12] !log dropped MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 from analytics-store eventlogging slave db [20:01:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [20:20:25] Analytics-Tech-community-metrics, DevRel-January-2016, Easy, Google-Code-In-2015, Patch-For-Review: Clarify Demographics definitions on korma (Attracted vs. time served; retained) - https://phabricator.wikimedia.org/T97117#1932354 (Aklapper) Open>Resolved Patch is deployed downstream on htt... [20:25:16] ottomata, so any idea why we're getting different results? [20:28:15] Ironholds: eh? [20:28:30] in hive for the EL example? no i'm still not sure what your problem is [20:28:37] i sudo-ed to ironholds user [20:28:41] and was able to run that example jsut fine [20:28:47] i had to change the partition date [20:28:48] your example or mine? [20:28:54] mine [20:28:57] do you have an example? [20:29:21] oh, i dropped of IRC for 2 minutes [20:29:25] maybe you pasted it and i missed it [20:29:33] right around 14:58 [20:29:43] (oh that's my time...is it yours too?) [20:30:54] ottomata, https://github.com/wikimedia-research/PortalSearchBoxTest/blob/master/data_validity.R#L10-L24 these lines error [20:30:56] ^ [20:32:25] k, looking [20:33:33] Ironholds: [20:33:37] '/wmf/data/raw/eventlogging/eventlogging_eventlogging_WikipediaPortal'; [20:33:40] is not a dir [20:33:41] you want [20:33:45] '/wmf/data/raw/eventlogging/eventlogging_WikipediaPortal'; [20:36:38] ottomata, oh, guh! [20:36:39] thanks! [20:37:21] sho thang :) [21:25:09] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#1932757 (Sadads) @Krenair you were already on it, but I think I did update it correctly for a... [21:29:50] Analytics-Tech-community-metrics, DevRel-January-2016, Easy, Google-Code-In-2015, Patch-For-Review: Clarify Demographics definitions on korma (Attracted vs. time served; retained) - https://phabricator.wikimedia.org/T97117#1932785 (Nemo_bis) I still have no idea what those things mean... > Attr... [21:56:26] nuria, ottomata: please do not drop the mobilewebsectionusage tables [21:56:38] in case you didn't see it, we had a thread about these on analytics-l last week [21:57:12] ...which also resulted in jdlrobson turning them off on monday, i think [21:57:51] oh wellll! too late :D [21:58:01] (the data is not gone, however, it is just not in MySQL) [21:58:30] after that thread i sat down with marcel last week and looked into accessing the data in hadoop, but it's not straightforward and hasnt been tested yet [21:58:41] yeah, its not as easy as mysql, but not too bad [21:59:04] the tables have been dropped :/ [21:59:10] nuria: ^ [21:59:25] ottomata: nice, let's see about replication in 1 hr [21:59:34] ottomata: i bet things are going to catch up [22:00:34] well, they've been dropped for a few hours now [22:00:41] not sure how to check [22:00:49] oh i guess we can check the table Ironholds was looking for [22:00:49] ottomata: in meeting will check back later [22:01:03] is there any way to get them back for doing analysis with mysql? [22:01:21] HaeB: That volume of data ? no, [22:01:55] nope, still 3 days behind [22:01:57] HaeB: sorry, [22:02:31] HaeB: the volume was much to high for the tables we think [22:02:42] folks, what's going on here? you don't read analytics-l, and then don't even bother with the analyst who's working on these tables before deleting them irreversibly ? [22:02:55] we talked about the volume last week already [22:03:18] and as i said, jon turned them off already (AIUI) [22:03:41] from earlier it looked like they were still being written to, iirc [22:03:42] don't even bother pinging [22:08:18] ottomata: should I reschedule the DevOps check point tomorrow? [22:08:28] ottomata: it conflicts with metrics [22:12:06] Analytics-Tech-community-metrics, DevRel-January-2016: What is contributors.html for, in contrast to who_contributes_code.html and sc[m,r]-contributors.html and top-contributors.html? - https://phabricator.wikimedia.org/T118522#1933110 (Aklapper) I still support my proposal in T118522#1845064 but I wonder... [22:15:22] lzia: sure [22:15:30] "[11:08] ottomata: that table is huge i think and I believe unqueriable" - that was nonsense, we did various successful analyses with it, see e.g. https://phabricator.wikimedia.org/T118041 or https://meta.wikimedia.org/wiki/Research:Which_parts_of_an_article_do_readers_read#Section_expansions [22:17:48] HaeB: many apologies. I don't pay attention to all threads in mailing lists, and I don't usually read discussions about schemas . [22:18:14] and again, Hadoop is not an equivalent option for this, e.g the existing queries we already wrote for mysql won't work there, and marcel and i did not see a way to set up a whole table right away in hive (as opposed to a single partition, i.e. one hour) [22:18:17] the MySQL slave is having trouble right now [22:18:29] Jaime is on vacation, and he is the only one that knows anything about it [22:18:52] I was told that these tables had been blacklisted from mysql yesterday, and was told we should just remove them (they were large and behind too), in an effort to help other tables replicate faster [22:19:18] ottomata: ok, but what changed? the newer (higher volume) version of the schema has been live for about four weeks already [22:20:06] HaeB: I don't know. its possible that your schema has nothing to do with the problem. But, we do avoid using mysql with higher volume data [22:20:13] ..and after last week's thead i understood from marcel that the issue was disk space, if at all - not the event rate per se [22:20:20] i doubt that your schema caused the lag, but high volume does not help [22:21:29] am reading that thread now... [22:22:17] HaeB: you should be able to query the whole data at once [22:22:21] that is def possible [22:22:25] in hive [22:23:05] just drop the partitioned by clause in the create table statement [22:23:44] Oh, hm. or maybe that doesn't work if the data is deep in subdirs, not sure... [22:24:05] but, you can always just add all the partitions and then query by the largest partition [22:24:13] like, where year in (2015, 2016) or something [22:25:25] as someone who has just had to do the manual construction it is a colossal PITA [22:25:45] I am more concerned by the communications element here. Even without reading the threads, someone owns every EL table and should be reached out to. [22:27:08] Ironholds: I agree, this was not done well. will talk with nuria and others in standup tomorrow [22:29:11] HaeB: Ironholds, spark might be easier for you., you can use wildcards [22:33:33] HaeB: no, event rate was a big issue [22:33:44] HaeB: we had 100 events persec [22:34:09] HaeB: dropping the table was kind of a desperate measure to be able to serve other users of the system [22:35:39] ja [22:35:42] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=5&fullscreen [22:36:08] HaeB: I am sorry about that but for us solving database issues w/o a dba arround requires desperate measures [22:36:40] HaeB: and to be clear event rate was an issue, replication was held on that table per dan's research on Monday [22:37:06] http://imgur.com/XjqHJeU [22:38:07] HaeB: so , while i understand that you prefer mysql we cannot guarantee that mysql would work regardless of the rate events are sent in [22:38:12] HaeB: it is just not possible [22:39:15] nuria: with all due respect, it seems you still haven't read the analytics-l thread. on january 3, marcel already said "it's sending around 120 events per second now", and the followup conversation made clear that that rate was maybe a concern, but by no means created a desperate situation in itself [22:39:31] plus, blacklisting the schema (no longer recording events) would have been fine [22:39:43] completely deleting without even pinging me -not [22:40:04] HaeB: sorry, we were talking about monday's (11 jan) conversation about rate [22:40:18] HaeB: rate was 100 events per sec [22:40:44] Analytics-Tech-community-metrics, DevRel-January-2016: Improve Key performance indicator: code contributors new / gone - https://phabricator.wikimedia.org/T63563#1933223 (Aklapper) ....and on http://korma.wmflabs.org/browser/code_contrib_new_gone.html I have to manually do the math to subtract "Abandoned"... [22:40:45] so even a bit lower than at the time of marcel's email [22:40:51] HaeB: but you are right, dropping table is not the best measure, note that we had blacklisted it already [22:40:59] hi mforns. question: is the demo you showed for pageview UI somewhere public? [22:41:11] lzia, yes, looking [22:41:33] HaeB: at the same time the tools we have to deal with these issues are very few [22:41:35] lzia, https://analytics.wmflabs.org/demo/pageview-api/ [22:41:43] HaeB: and data hasn't been lost [22:41:46] perfect. thanks mforns. [22:42:08] HaeB, you're right about what I said, but the situation got a lot worse since then [22:42:44] what is the time frame for which it plots, mforns? [22:42:45] Ironholds: will send e-mail. we just got our heads above water on this. [22:43:32] lzia, the demo accepts dates from Oct 1st [22:44:25] lzia, but the date selector, now that I look at it, has broken icons (months arrows) [22:44:35] sorry, mforns. ignore my confusion: it doesn't show the plots in Firefox. Now that I am in Chrome, it works perfectly fine. thanks. [22:44:40] it works, but it's difficult to see [22:45:15] lzia, mmmmm, we must have added some new code that breaks in firefox, it should work, will create a task for that [22:45:33] perfect. thanks much, mforns. :-) [22:46:18] Analytics: Pageview API demo is broken in Firefox and does not show date selector arrows in Chrome - https://phabricator.wikimedia.org/T123584#1933243 (mforns) NEW [22:51:06] Ironholds, HaeB: have sent e-mail regarding dropping of table, sorry about it again. [22:52:06] fair. And do we only have one DBA in the org? nobody else can solve the underlying problem? :/ [22:52:18] Ironholds: at this time we have zero dbs [22:52:23] dbas [22:52:35] blargh [22:52:37] Ironholds: cause jaime also needs vacation you know [22:52:46] yeah, totally [22:53:06] dear mark b. Bus factors! [22:53:37] so while i am sorry this is a pain (cc HaeB) and i take responsability for it, it was , i believe, the fastest way to get EL working for the rest of the users without loosing data [22:54:11] HaeB: , Ironholds, I just edited the pyspark example to use wildcards and event data [22:54:12] https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Spark [22:54:25] pyspark + spark SQL [22:54:48] Ironholds: we had 2, but one left, they are trying to get a nother [22:54:57] *nod* [22:55:28] Ironholds: loading data like "/wmf/data/raw/eventlogging/eventlogging_MobileWikiAppFindInPage/hourly/2016/*/*/*" [22:55:28] in spark works [22:56:33] This is incredibly cool but if it takes long enough to fix the replag that I have to learn an entirely new process in a programming language I do not use and then switch over all our data collection scripts to that method and language we have a much more serious problem. [22:56:53] so I will probably use it for hyper-urgent ad-hoc stuff and just notify people that some-but-not-all dashboards will be broken for an unknown amount of time [22:58:05] Ironholds: You are so right, ay ay ... [22:58:30] that probably sounds overly critical, in which case I apologise :D. It is super-cool and I appreciate ottomata's updates: they do make that ad-hoc work easier! [23:07:24] ottomata: yes, as Ironholds says, that looks very cool, but it's not very helpful for me either as i would be using hive, not pyspark [23:07:57] HaeB: ok, why not spark though? ( you can use python or scala, either / or) you still write sql, you just have to tell it how to load the data [23:16:52] ottomata, in my case because you're asking someone to learn an entirely new technological stack and rewrite all of their stuff. That's a helluva lot of effort to have dumped on you at once and it plays hell with timetabling [23:17:14] like, I am happy to learn Spark if you are happy to talk to my boss and explain to him that all the things I was going to get done this week aren't and it's because I will be studying ;p [23:17:19] ottomata: well, learning spark (and setting up workflows around it) sounds like a fun project i might embark on at some point, but i don't know how soon i might be able to set aside time for it [23:17:36] Ironholds: ha ;) [23:17:47] synchronicity [23:17:51] * Ironholds high-fives [23:18:49] aye cool [23:23:54] Ironholds: HaeB, because I am not very familiar with this...one of the reasons you need EL stuff analytics-store is beause it links to MW dbs, right? [23:25:14] no, he needs EL stuff in the analytics-store because you deleted his data from MySQL ;p [23:25:19] no, i did not need that so far in this case (it might have some benefits though because the schema includes page IDs without the corresponding titles, which one might want to look up in the MW dbs) [23:25:25] I need it because replag means the data I needed to validate does not need it in MySQL [23:25:31] ..does not appear in [23:28:58] : HaeB again sorry, but given graphana and the flow of events to the table since 12/18 and given that that schema was about 90% of EL inflow since the 2016/01/01 we cannot guarantee that mysql would be working on those conditions of data volume [23:29:44] well, i'm just trying to figure out how we could help in the short term [23:29:53] since we really have no idea what is going on with the replication here [23:29:57] and jaime is on vacation [23:33:14] nuria mforns: i understand your reasoning for the blacklisting (and am ok about it , if not super happy - there should have been a notification for that already, and a followup in the existing thread) [23:35:12] Analytics: Restore MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 - https://phabricator.wikimedia.org/T123595#1933412 (Tbayer) NEW [23:35:17] ok, i'll be in a meeting now - i've filed a task after discussion with dr0ptp4kt, perhaps we can continue talking there and on the thread nuria started [23:40:01] Analytics: Restore MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458 - https://phabricator.wikimedia.org/T123595#1933431 (Nuria) Tables will start existing once blacklisting is lifted, let us know when new sampling ratio has taken effect. >The suggestion was to access the data in Hadoop in...