[01:05:52] madhuvishy: I hated the gerrit model of git so much that I wrote a github <-> gerrit bridge, and used that exclusively :P [01:06:13] madhuvishy: and then I spent more time understanding it and now think the gerrit model is nicer, and stopped maintaining the bridge [01:06:20] however, the UI is still a horrible piece of shit [01:06:30] madhuvishy: and you aren’t stupid, no [10:26:55] mforns: Hi ! [10:27:02] hey joal! [10:27:24] Let me know if you want to have a chat on how we start on the python stuff :) [10:28:01] joal, what is your schedule today? [10:28:40] joal, I'll be working for 1 more hour now, and then from 16h to 21h CET. [10:28:47] Starting now, probably a break in the afternoon [10:28:50] ok [10:29:06] So, let's put together a plan now, and then see :0 [10:29:08] :) [10:29:11] ok [10:30:00] batcave ? [10:30:07] in 5 ins ? [10:30:10] do you mind talking on IRC, I'm in a library... [10:30:16] np :) [10:30:20] but in the afternoon I'll be at home [10:30:43] ok [10:31:16] So, how do we approcah the thing ? [10:31:29] so, we should modify the current eventlogging to use kafka as [10:31:44] substitute zmq by kafka [10:31:49] right [10:31:56] and what else? [10:31:58] do we try to get a place for testing ? [10:32:51] sure, you mean a kafka for ourselves? [10:33:38] hm, kafka in labs should be enough (we can ask andrew first, but should be ok) [10:33:49] ok [10:34:00] But we still need a testing place for the python in labs [10:34:12] we can use beta-labs eventlogging instance, no? [10:34:23] Current EL in labs is used by our customer, so I wouldn't go and break it [10:34:52] But maybe just letting the customer know is enough ? [10:34:54] ok, it will take some time so you're right [10:34:54] Don't know ... [10:35:52] ok, so should we launch a labs instance? [10:36:07] I think it could be a good idea [10:36:25] The thing is I don't know how to deploy EL globally :( [10:36:38] I don't know well enough the full flow [10:36:50] me neither [10:36:58] But starting a labs instance id good first step [10:37:09] So lets do that, and then try to deploy :) [10:37:24] but we can try to checkout puppet repo and try to execute it [10:37:26] ok [10:38:16] are you creating it? I'm on the Instance list [10:39:23] medium? [10:39:37] please go ahead, I'm finishing emails :( [10:39:43] Thanks ! [10:41:30] (CR) Joal: [V: 2] Use wmfuuid from x_analytics_map for app install id if available else default to the query parameter [analytics/refinery] - https://gerrit.wikimedia.org/r/207689 (https://phabricator.wikimedia.org/T96926) (owner: Madhuvishy) [10:43:38] joal, it doesn't work for me :'( [10:43:51] hm, you mean, deploying el / [10:44:09] ? [10:44:15] joal, I could only create a small instance, medium and large do not work for me [10:44:41] k ... KNowing that we'll only have very small amount of data, maybe small is enough ? [10:45:37] ok [10:55:27] joal, mmm seems that eventlogging is not puppetized? [10:55:41] hm [10:56:10] ok I'll try to install raw eventlogging [10:56:29] I am looking at deployment-eventlogging02.eqiad.wmflabs [10:56:34] See if there is any info [10:56:48] Seems that there is a eventlogging role [10:59:13] hm [10:59:19] Can't find a puppet code base neither [10:59:38] I guess we can wait for andrew about that [10:59:42] mforns: --^ [11:00:10] joal, I think the eventlogging role is for vagrant puppet? [11:00:20] might [11:07:07] joal, I installed eventlogging, but it does not work... I'm trying to reboot to see if upstart launches it [11:07:29] mforns: do you have a dedicated DB ? [11:07:43] joal, oh forgot that.. [11:12:58] It looks http://language-reportcard.wmflabs.org/ stopped working after April 30? [11:13:38] anyone can take a look? [11:14:30] Hi kart_ [11:15:02] joal: hey [11:15:09] joal: can you look at that? [11:15:12] I don't know exactly how data for language-reportcard is generated, but will find out later today [11:15:27] I think milimetric knows best [11:15:35] joal: yes. he knows :) [11:16:09] Sorry for not helping before he gets there :S [11:16:18] joal: no issue! [11:18:42] mforns: so, maybe it was the wrong path to try and deploy EL on our own :) [11:19:39] joal, I applied the role eventlogging to the instance [11:19:47] and also installed mysql [11:19:53] ok :) [11:19:55] Cool ! [11:21:10] joal, but I'm not sure what the role is supposed to do [11:21:53] joal, since I rebooted with the new role, there's this process executing: deploy.fetch eventlogging/EventLogging [11:22:24] Sounds good for install [11:22:37] Now the thing is, how to ensure it gets deploy on its own :) [11:22:51] mmm [11:28:57] joal, well, I have to go now.. I can continue at 16h CET [11:29:14] np, I am trying to get through some python [11:29:18] Thanks mforn :) [11:29:22] thanks! [12:22:57] ottomata: Morning :) [12:31:14] mooorning! [12:31:36] whassup ? [12:35:11] Have a minute for me ottomata ? [12:35:22] suuuure [12:35:30] thx, batcave ? [12:35:32] yup [12:55:52] joal: you gone? [12:55:56] nope [12:55:58] ? [12:56:02] Seems I am [12:56:06] Let me get bacvk [13:15:54] Analytics-Tech-community-metrics: https://www.openhub.net/p/mediawiki stats out of date - https://phabricator.wikimedia.org/T96819#1268643 (Qgil) Open>stalled No reply and nothing I can do. [13:26:08] (PS2) Ottomata: Correct webrequest refinement bug about spiders not labelled correctly [analytics/refinery] - https://gerrit.wikimedia.org/r/207814 (owner: Joal) [13:26:15] (CR) Ottomata: [C: 2 V: 2] Correct webrequest refinement bug about spiders not labelled correctly [analytics/refinery] - https://gerrit.wikimedia.org/r/207814 (owner: Joal) [13:27:08] qchris_: Hi Sir [13:27:13] Do you have minute for me ? [13:29:17] joal: http://semver.org/ [13:32:21] Analytics-Tech-community-metrics, ECT-May-2015: Ensure that most basic Community Metrics are in place and how they are presented - https://phabricator.wikimedia.org/T94578#1268687 (Qgil) What about * Users who contributed a first patch per month. I think it is an important one, showing how are we doin... [13:32:52] gone ? [13:32:56] ottomata: --^ [13:32:57] ? [13:33:06] or is it me ? [13:34:11] gone again :) [13:37:37] (PS1) Ottomata: Update changelog.md in prep for v0.0.10 release [analytics/refinery/source] - https://gerrit.wikimedia.org/r/209473 [13:37:57] (CR) Ottomata: [C: 2 V: 2] Update changelog.md in prep for v0.0.10 release [analytics/refinery/source] - https://gerrit.wikimedia.org/r/209473 (owner: Ottomata) [14:01:15] ottomata: something else: can you add new versions of spark, hadoop and hive to archiva ? [14:05:00] oh, yes. hm, we should build againstt cdh 5.4 [14:05:01] HMMM [14:05:23] Sorry, heavier than what I thought this upgrade ... [14:10:43] Analytics-Kanban, operations: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1268864 (fgiunchedi) thanks for the report @milimetric! indeed the culprit was due to multiple interactions/renaming with: timers, simple counters and extended c... [14:11:20] Analytics-Kanban, operations: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1268875 (fgiunchedi) p:Triage>Normal a:fgiunchedi [14:13:50] joal: 0.0.10 released, going to 0.0.11 with cdh 5.4 now... [14:13:56] will let you know [14:14:25] coll, thx ottomata [14:14:52] joal, I'm back [14:15:05] Hi mforns [14:15:09] batcave for a minute ? [14:15:12] hellooo [14:15:13] sure [14:17:59] (PS1) Ottomata: Building version 0.0.11 against CDH 5.4.0 packages [analytics/refinery/source] - https://gerrit.wikimedia.org/r/209485 [14:18:39] (CR) Ottomata: [C: 2 V: 2] Building version 0.0.11 against CDH 5.4.0 packages [analytics/refinery/source] - https://gerrit.wikimedia.org/r/209485 (owner: Ottomata) [14:27:10] joal: 0.0.11 released. [14:28:15] oh, not yet. [14:28:16] sorry [14:28:17] still uploading [14:37:51] np [14:39:05] mforns: did you see graphite data is back? [14:39:07] http://graphite.wikimedia.org/render/?width=1059&height=509&_salt=1431009470.333&target=eventlogging.overall.raw.rate&target=eventlogging.overall.valid.rate&from=00%3A00_20150501&until=23%3A59_20150507 [14:39:14] that's a crazy jump yesterday [14:39:33] milimetric, no! [14:39:40] mforns, milimetric : I wonder if this is due to ottomata [14:40:10] Analytics-Kanban, operations: Event Logging data is not showing up in Graphite anymore since last week - https://phabricator.wikimedia.org/T98380#1268990 (Milimetric) Open>Resolved thanks very much @fgiunchedi, I'm not sure what that means but I'll try to get Andrew to translate :) [14:41:23] milimetric, joal, graphite-eventlogging has some strange metrics, that happened before [14:41:28] there are a lot of metrics now [14:41:54] milimetric: i restarted the hafnium consumer today [14:42:04] i saw that there hadn't been data since the 29th of april [14:42:13] and that was the last time the consumer was started [14:42:16] Analytics-Cluster, Analytics-Kanban: Report pageviews to the annual report - https://phabricator.wikimedia.org/T95573#1269007 (Milimetric) I thought we were going to try and do this with piwik instrumentation. That would be best since we're already trying to do way too many things at the same time. [14:42:17] what's bizzarre ? [14:42:18] so iunno, figured somethign was werid [14:42:20] so i just restarted it [14:42:22] milimetric: --^ [14:42:49] ottomata: no, that's not it [14:42:52] I had restarted it a few times [14:42:58] did you see the comment on that bug? [14:43:04] https://phabricator.wikimedia.org/T98380#1268864 [14:43:07] from filipo [14:43:11] Analytics-Cluster, Patch-For-Review: Link refinery-source with cdh5.3.1 - https://phabricator.wikimedia.org/T93952#1269016 (Ottomata) Open>Resolved We are on 5.4.0 now [14:43:43] Analytics-Cluster, Analytics-Kanban: analytics/refinery/source unbuildable since commit that switched to CDH 5.3.1 packages - https://phabricator.wikimedia.org/T97278#1269018 (Ottomata) Open>Resolved I just blasted my .m2 on stat1002 and was able to build. Feel free to reopen if you still have prob... [14:44:18] Analytics-Cluster, operations, Patch-For-Review: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1269020 (Ottomata) Bump, @AndyRussG have you had a chance to look at these? [14:45:59] Analytics-Cluster, operations: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1269031 (Ottomata) [14:46:38] Hi ottomata! Thanks following up on the phabricator task and for the reminder... :) I was ooo yesterday and the day before, but I'll have a look today for sure [14:47:57] Analytics-Cluster, Datasets-Archiving, Datasets-Webstatscollector, Patch-For-Review: Mediacounts stalled after 2015-04-26 - https://phabricator.wikimedia.org/T97753#1269036 (Ottomata) Open>Resolved Should be good now, thanks for the heads up! [14:48:24] Analytics-Cluster, Patch-For-Review: Upgrade Analytics Cluster to CDH 5.4.0 - https://phabricator.wikimedia.org/T97453#1269039 (Ottomata) Open>Resolved [14:49:23] Analytics-Cluster: Import Mediawiki XML dumps into HDFS - https://phabricator.wikimedia.org/T76347#1269043 (Ottomata) Open>declined We might do this, but joal is kinda taking it on, and much work is is happening in the Altiscale hadoop cluster. [14:51:30] joal: you gonna -rerun that pagecounts all sites job? :) [14:51:33] i can do it if you like [14:51:58] ottomata: working with mforns currently, if you are at it, please go :) [14:52:04] If not, will do it in a bit [14:52:07] thanks AndyRussG, I'll be not working much of next week, so it isn't a huge hurry, but it would be nice to be able to work on that when I get back [14:52:09] ottomata: --^ [14:52:11] ok joal doing [14:52:16] Thnks ) [14:52:38] ottomata: K u bet! thanks much :) [14:53:03] (PS1) Milimetric: Update for May meeting [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/209495 [14:53:22] (CR) Milimetric: [C: 2 V: 2] Update for May meeting [analytics/reportcard/data] - https://gerrit.wikimedia.org/r/209495 (owner: Milimetric) [15:30:38] (PS1) KartikMistry: Add languages deployed on 20150507 [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/209501 [15:46:26] Quarry: JSON output should be one row per JSON blob - https://phabricator.wikimedia.org/T98492#1269220 (Halfak) NEW [15:49:03] (CR) KartikMistry: [C: 2] Add languages deployed on 20150507 [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/209501 (owner: KartikMistry) [15:51:05] (Merged) jenkins-bot: Add languages deployed on 20150507 [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/209501 (owner: KartikMistry) [15:55:03] (CR) Erik Zachte: [C: 2 V: 2] Add some more author aliases [analytics/wikistats] - https://gerrit.wikimedia.org/r/92069 (owner: Nemo bis) [15:56:35] wheee [16:02:44] milimetric: got a sec to look at https://phabricator.wikimedia.org/T94271 ? trying to figure out if we can get pageview count (ish) data out of Hadoop(?) and into the production wiki DB [16:03:17] tgr: we don't want to let people rely on Hadoop for production data delivery yet [16:03:31] just for analysis and research right now, and for batched data crunching [16:03:40] is there an alternative? [16:03:51] batching would be fine [16:04:04] tgr, sorry, I'll get back to you - I should pay attention to this meeting I'm in [16:04:10] thanks [16:13:23] haha, milimetric, tgr, everybody wants scalable eventlogging + streaming! [16:20:55] milimetric: around now? [16:22:40] milimetric: ok. when you're free, take a look at, language-reportcard.wmflabs.org and let me know where I messed up :) [16:31:09] kart_: you did nothing wrong, but that's too many tables to use in a join [16:31:10] Too many tables; MariaDB can only use 61 tables in a join [16:31:50] I can't argue with that, seems like a fairly sensible limit. So your only real option right now is to break up the graph into two graphs maybe? [16:32:00] and break up both queries into two queries [16:32:24] and make sure each query only has up to 60 tables. But, realistically, that graph doesn't make much sense with that many lines on it [16:32:51] milimetric: blah :) [16:33:02] milimetric: agree. [16:33:22] milimetric: I'll work on it next week then. [16:35:04] Analytics-EventLogging, Analytics-Kanban, Wikimedia-Search: Estimate maximum throughput of Schema:Search (capacity) {oryx} - https://phabricator.wikimedia.org/T89019#1269467 (Deskana) This schema is in production right now with a sampling rate of 0.1% and has been for several weeks. Not sure how that... [16:35:56] kart_: btw, after you do that, I'll have to change my script that runs it every night. So until then, you won't get any data, and you won't have any way to get the data back [16:36:08] kart_: would you like me to break it up? Otherwise you'll lose the data [16:36:46] I'll brb, gotta get some lunch [16:36:47] milimetric: please. data is important. [16:37:19] milimetric: ping me, I just finished deployment and it was crazy day for me :) [16:46:27] (CR) Erik Zachte: [C: 2 V: 2] Add perl dependencies to a CollectMailArchives.pl comment [analytics/wikistats] - https://gerrit.wikimedia.org/r/91998 (owner: Nemo bis) [16:54:26] joal: I was gonna work on https://phabricator.wikimedia.org/T88433 next. do you some time to chat about it? [16:55:20] Hi madhuvishy, I am quite busy today ... but I encourage you to look at oozies SLA [16:55:25] It might solve the issue ? [16:55:27] not sure [16:56:17] joal: sure no problem. i'll do some reading and ask you tomorrow. I'm yet to understand how oozie works fully. [16:56:25] np :) [16:56:35] thanks [16:59:32] (CR) Erik Zachte: [C: 2 V: 2] Link tables from edit/reverts reports [analytics/wikistats] - https://gerrit.wikimedia.org/r/147175 (https://bugzilla.wikimedia.org/65992) (owner: Nemo bis) [17:25:29] Analytics-Wikistats, Tracking: Increase WikiStats reports discoverability - https://phabricator.wikimedia.org/T67991#1269559 (Nemo_bis) [17:25:30] Analytics-Wikistats, Patch-For-Review: Link tables from edit/reverts reports - https://phabricator.wikimedia.org/T67992#1269557 (Nemo_bis) Open>Resolved Thanks Erik for the merge! [17:30:58] (PS1) Joal: Bump refinery-core and refinery-hive jars version to 0.11.0, add referer_class to webrequest refine table and update webrequest refined recordVersion [analytics/refinery] - https://gerrit.wikimedia.org/r/209532 [17:58:25] ottomata, pokey :) [18:02:00] hiya [18:02:02] Ironholds: :) [18:02:18] ottomata, fluorine, mkdir /a/public-datasets/ plz? ;) [18:03:18] ? [18:03:54] wassup? [18:07:14] joal, yt? [18:10:01] hey mforns [18:10:15] hey joal, quick question about backfilling EL [18:10:27] sure [18:10:49] joal, so with the replace flag is it possible to avoid duplicating old events? [18:10:58] events that already are in the db? [18:10:59] ottomata: you there ? [18:11:09] mforns: correct :0 [18:11:11] :) [18:11:30] If events are already there, mysql raises an error [18:11:44] the replace flag doesn't break the process in case of error [18:12:05] So in fact, it's not replace, it's prevent failing in case of already existing error [18:12:12] mforns: --^ [18:12:14] :) [18:12:20] joal, so, if there's a period with, let's say 50% of events that made it to the db, I can backfill the full event log, that the events in the log that already exist in the db won't be duplicated, and just those that were not there will be inserted [18:12:33] I see [18:12:55] joal: ja [18:12:57] in the car! [18:12:58] cool joal, thanks [18:12:59] phone internet is amazing! [18:13:03] mforns: That's why I think [18:13:12] ottomata: amazing indeed :)h [18:13:32] mforns: ifsome schema don'y make use of uuid, I can't be sure [18:13:43] joal, mmm [18:14:23] ottomata: I have a code review for you, but i'll probably be tomorrow I guess :) [18:14:47] naw i'm working [18:14:51] i'll look in a bit [18:14:57] thanks a lot [18:15:17] if we merge, I'll deploy and restart the refine job [18:16:01] joal, I remember that we changed EL db some time ago, for the new schemas to not use an id, just leave it for the old revisions. But I can not find a table without id now.. [18:16:16] hmm [18:16:29] joal, yes I found one [18:16:39] If so, that means I have inserted duplicates while backfilling ... [18:16:53] Nuria told me it was ok, so I didn't double check [18:16:59] milimetric: Have an idea on this ? [18:17:16] mforns: which ? [18:17:19] i'm watching the metrics meetring [18:17:24] you guys should too! [18:17:25] :) [18:17:28] :) [18:17:32] You are right ! [18:17:36] joal, MobileWebSearch_12054448 [18:17:45] milimetric, oh! [18:18:09] joal, let's watch? [18:18:36] yup [18:18:44] be back in a bit [18:21:38] mforns: on your example, uuid is PK [18:38:47] Analytics-Tech-community-metrics, ECT-May-2015: Maniphest backend for Metrics Grimoire - https://phabricator.wikimedia.org/T96238#1269704 (sduenas) We have developed the first version of Maniphest backend. This means we're now ready to collect data. My idea is that it will last around 2 days and a half.... [18:41:08] milimetric: is there a calendar in which all these wmf events are? if so how do i get on it? [18:41:26] madhuvishy: I just grab it off Andrew's [18:41:34] milimetric: aah [18:41:37] but yes, it should be in one... let me look [18:44:20] Analytics-Tech-community-metrics, ECT-May-2015: Maniphest backend for Metrics Grimoire - https://phabricator.wikimedia.org/T96238#1269748 (chasemp) >>! In T96238#1269704, @sduenas wrote: > We have developed the first version of Maniphest backend. This means we're now ready to collect data. > > My idea is... [19:03:41] oo more internet! [19:06:34] (CR) Ottomata: [C: 2 V: 2] Bump refinery-core and refinery-hive jars version to 0.11.0, add referer_class to webrequest refine table and update webrequest refined reco [analytics/refinery] - https://gerrit.wikimedia.org/r/209532 (owner: Joal) [19:06:37] joal: merged. [19:20:36] ottomata, sorry, wasn't pinged [19:20:59] could you create /a/public-datasets/ or equivalent on fluorine so I can set up an rsync connector to datasets.wikimedia.org? [19:21:09] the search team needs to be able to surface search data, basically [19:21:47] grand :/ [19:35:24] milimetric: by stats endpoint do you mean exposing the ability to run hadoop queries to restbase? [19:36:14] tgr: no, it needs to be much faster than hadoop [19:36:18] like cassandra [19:36:29] or druid, plain postgresql, etc. [19:36:56] hadoop is only for huge data, this store would have pre-aggregated stuff [19:41:54] milimetric: but will it provide some ability to query aggregated results or just single urls? [19:42:47] Team, currently deploying new refinery on Hadoop [19:42:55] tgr, it should aggregate in multiple ways, at first by language, project, and article, and eventually by category, namespace maybe, etc. [19:43:19] but honestly right now Event Logging is proving a big handful and we must take care of it first [19:43:25] next is this pageview API [19:45:19] milimetric: I see, thanks [19:46:14] the curse of being an infrastructure team that has products to develop :/ [19:46:24] milimetric: +1 [20:28:37] stat1002 is kicking me out randomly again [20:30:34] Deployment done, double checking everyhting is fine [21:39:14] did Ottomata just take the afternoon off, orr [21:55:53] Analytics-Kanban: Split up language reportcard queries, data files, and graphs - https://phabricator.wikimedia.org/T98532#1270674 (Milimetric) NEW a:Milimetric [22:12:19] Ironholds: yeah he mentioned he'd be offline a bit today [22:12:34] just my luck [22:16:02] hellloooo [22:16:03] whoa slow internet [22:16:08] might have to find a cafe tomorro [22:16:10] w [22:16:11] Heya :) [22:16:16] good ol' appalachia [22:16:23] you've lonking after :) [22:16:30] ? [22:16:36] you've been looked after [22:16:41] pfff, late for me :) [22:16:43] oh? [22:16:46] indeed it is! [22:16:51] how goes? [22:16:53] Ironholds: ottomata is here :) [22:17:03] joal, yeah, got it resolved by someone else [22:17:03] not so good : queue issues [22:17:05] but thank you [22:17:11] np [22:17:20] queue issues? [22:17:38] ottomata: yup, bundle queue_name parameter doesn't work ... [22:17:50] -Dqueue_name=production [22:17:51] hmmmm no? for which job? refine? [22:17:55] yup [22:18:02] it isn't going to productoin? [22:18:06] paste me whole command? [22:18:51] https://gist.github.com/jobar/9f33686fd8d1697cb1d9 [22:20:00] from coord.properties: queue_name and oozie_launcher_queue_name defined hardcoded [22:21:01] hm, not related i think, but we want to revert that oozie queu name change i think [22:21:04] I was thinking relaunching with -Doozie_launcher_queue_name=production [22:21:08] i just set it to productino right now [22:21:12] you should do both [22:21:13] right? [22:21:21] both ? [22:21:26] both queues [22:21:31] ok, will try [22:22:18] yeah, both queues, we made the oozie_launcher_queue_name and the default now is 'oozie' [22:22:24] but, the 'oozie' queue did not do well for us [22:22:41] but leaving the launchers in the same 'production' queue has been fine [22:22:51] but, we haven't reverted the oozie queue change [22:23:02] yup [22:23:04] we probably should, or at least set it to ${queue_name} by default [22:23:19] k--> working [22:23:29] I would aggree :) [22:24:23] relaunched mobile uniques s well (madhu's change) [22:25:11] will check with misc that it works ine. then go to bed :) [22:26:38] ok thanks latenight joal! [22:26:40] youdabest [22:26:50] Ironholds: i don't understand the fluorine need [22:26:53] rmf :) [22:27:07] how does creating that directly get you data? [22:32:29] Fixed ! [22:32:37] G'night all ! [22:33:56] woot! [22:33:59] nighters [22:45:57] milimetric: the table that converts from en.wikipedia to enwiki - is it on hadoop? [22:50:19] k, goin for a walk before the sun goes down, ttyall tomorrow [22:52:30] madhuvishy: there's no table, I was suggesting we build one. There's just some python code that deals with the conversion [22:52:47] milimetric: aaah. [22:52:55] milimetric: where is this code? [22:52:56] but it does so the other way, from the canonical database name to the corresponding hostname [22:53:04] one sec [22:55:19] madhuvishy: https://github.com/wikimedia/analytics-aggregator/blob/master/aggregator/util.py#L102 [22:56:38] the reverse (which is what this code is doing) is much easier and will not map to the weird hostnames, but it'll let you separate the expected hostnames from the weird ones [22:57:34] milimetric: hmmm [22:58:57] milimetric: is there a list of all these db names somewhere? (enwiki, dewiki, etc) [23:00:18] madhuvishy: https://github.com/wikimedia/operations-mediawiki-config all.dblist [23:01:58] yuvipanda: thanks! [23:02:55] madhuvishy: also, I forget how to access it but there's a way to get json from the site matrix: https://www.mediawiki.org/wiki/Special:SiteMatrix [23:03:31] oh, here's the python code in wikimetrics that does it: https://github.com/wikimedia/analytics-wikimetrics/blob/master/scripts/admin#L30 [23:05:39] milimetric: cool. but this is still not converting uri_host to this format right? i'm wondering what the general algorithm to do that is. [23:06:53] madhuvishy: I tried to implement that once, but it's fraught with peril. I have the code parked somewhere in gerrit. What I was trying to say above is that the reverse is much easier [23:07:07] so basically, start with all the dbs you know (enwiki, dewiktionary, etc.) [23:07:15] and add "mobile" etc. to them, and go backwards. [23:07:20] milimetric: of course. [23:07:32] you get hostnames, then you get a delta of all the crappy hostnames that don't match any known db [23:07:40] those you can just report on together, as "unknown" [23:07:47] or "weird" [23:07:53] or just ignore completely [23:09:02] milimetric: okay. i can try that. this will go into refinery right? I can write python? [23:10:33] well, no, the python for it is already in refinery [23:10:56] if we use the analytics-aggregator to get this data out, then we don't need to re-do this logic [23:11:18] if we use something else, we'll have to see where to port this logic (Hive UDF, Spark, etc) [23:11:29] that's why I was saying not to do this for now [23:11:35] but I'm not sure what you're going for [23:11:57] milimetric: oh.. i don't know, i'm a bit jobless so was trying to figure it out [23:12:41] ah, that sucks, that's my bad now that Kevin's gone [23:12:54] milimetric: ha ha its not like i dont have any tasks [23:12:56] um, when are you done for the day? [23:14:42] milimetric: this is left on my onboarding list. https://phabricator.wikimedia.org/T88433. was doing related reading which joseph suggested - but dont have clear ideas yet. [23:14:53] um, in an hour or 2? [23:16:44] madhuvishy: ok, I was trying to finish this terrible query splitting task [23:16:52] give me another 15 minutes and then let's talk for a bit? [23:16:59] I also filed this - https://phabricator.wikimedia.org/T98257. And thought the tables that map are already there - and I can change my reports query. [23:17:04] milimetric: sure! [23:33:42] madhuvishy: ok, batcave? [23:51:34] (PS1) Milimetric: Split up queries to avoid MariaDB limit [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/209666 (https://phabricator.wikimedia.org/T98532) [23:52:02] (CR) Milimetric: [C: 2] "crossing my fingers that this will work tonight, if not I'll try again tomorrow." [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/209666 (https://phabricator.wikimedia.org/T98532) (owner: Milimetric) [23:56:20] (PS1) Milimetric: Split up queries to avoid MariaDB limit [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/209667 (https://phabricator.wikimedia.org/T98532) [23:57:22] (CR) Milimetric: [C: 2] "I messed up the previous commit, this one fixes it." [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/209667 (https://phabricator.wikimedia.org/T98532) (owner: Milimetric) [23:58:03] I'm 33 years old and that is the worst SQL query I've ever written ^^ [23:58:04] ugh [23:58:05] god [23:58:10] * milimetric goes to hide [23:59:43] milimetric: awww. [23:59:47] * madhuvishy sends hugs