[01:18:41] (03PS1) 10Milimetric: Clean up style and tweak UX [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387502 (https://phabricator.wikimedia.org/T178084) [01:27:25] (03PS2) 10Milimetric: Clean up style and tweak UX [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387502 (https://phabricator.wikimedia.org/T178084) [08:39:04] 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack and setup db1107 and db1108 - https://phabricator.wikimedia.org/T177405#3722441 (10Marostegui) db1047's data has been migrated and imported to db1108 [08:47:22] morning! [08:47:28] db1108 is finally ready! [08:47:30] * elukey dances [08:47:48] win 11 [08:47:51] argh [08:52:48] 10Analytics, 10DBA, 10Operations: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3722449 (10Marostegui) db1047's data has been migrated to db1108. It is working fine now. We are going to leave db1108 working for a few days, make sure the event logging syn... [09:26:51] Hi elukey :) [09:26:57] * joal dances with elukey :) [12:57:17] (03PS1) 10Fdans: Handle negative values in graphs [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387561 (https://phabricator.wikimedia.org/T178461) [13:00:21] joal: qq brainbounce over irc if you have time [13:00:30] Go ahead elukey :) [13:00:32] :) [13:01:06] so the druid prometheus collector needs to be a bit smart about how to collect data, since the way in which data is retrieved is a bit different [13:01:30] instead of pushing metrics to some collector like statsd or graphite, we have to wait for the prometheus master to poll [13:01:47] this means that for each metric, we need to expose only one datapoint [13:02:07] milimetric: if we merge that change we're ready to deploy and announce :) [13:02:49] for example, say we need to collect query/time for the historical and broker [13:03:27] if I add a metric for each request latency logged, I end up with duplicates [13:03:38] # TYPE druid_historical_query_time gauge [13:03:38] druid_historical_query_time{datasource="test"} 15.0 [13:03:38] druid_historical_query_time{datasource="test"} 15.0 [13:04:17] (basically each prometheus exporter on the host offers a http interface that respond to GET /metrics) [13:04:43] so the people doing the nginx exporter came up with a clever approach [13:04:46] # TYPE druid_historical_query_time gauge [13:04:49] druid_historical_query_time{datasource="test"} 15.0 [13:04:49] argh sorr [13:04:58] nginx_http_request_duration_seconds_bucket{host="thumbor.svc.eqiad.wmnet",le="000.01"} 4108589 [13:05:01] nginx_http_request_duration_seconds_bucket{host="thumbor.svc.eqiad.wmnet",le="000.05"} 4537509 [13:05:04] nginx_http_request_duration_seconds_bucket{host="thumbor.svc.eqiad.wmnet",le="000.10"} 8008138 [13:05:07] nginx_http_request_duration_seconds_bucket{host="thumbor.svc.eqiad.wmnet",le="000.12"} 26359129 [13:05:30] the things in {} are prometheus labels, nothing super important, it is like having different metrics [13:05:35] fdans: mmm, I wouldn't be so quick to judge, I was gonna keep going over it until I get that fuzzy feeling [13:05:44] in this way we offer only a counter of "buckets" [13:05:47] but you're welcome to merge and deploy if you like the change [13:06:13] joal: I gotta shower and eat something, then I'm dying to hear your thoughts [13:06:39] milimetric: I'm gonna take a break son [13:06:42] soon sorry [13:06:50] milimetric: Don't rush :) [13:07:16] elukey: I don't get it :( [13:08:16] elukey: trying to rephrase: We need to be smart === We need to aggregate data before it get's polled ? [13:08:43] yes! [13:09:01] Right [13:09:09] so I was wondering what buckets we'd need [13:09:35] hm - I get it now :) [13:10:08] Since we need to aggregate on our side, defining buckets makes it really easier than maintaining a hell lot of values [13:13:25] elukey: I actually have no clue - I think we'd need to analyse existing values before answering [13:14:25] Event [{"feed":"metrics","timestamp":"2017-10-31T12:28:18.063Z","service":"druid/broker","host":"druid1004.eqiad.wmnet:8082","metric":"query/time","value":1, [13:14:31] is that supersonic? [13:14:48] from http://druid.io/docs/0.9.2/operations/metrics.html those should be ms [13:15:44] elukey: There might be supersoni queries ? [13:16:15] taking 1ms to complete? wow [13:16:59] historical: [13:16:59] Event [{"feed":"metrics","timestamp":"2017-10-31T10:20:37.552Z","service":"druid/historical","host":"druid1004.eqiad.wmnet:8083","metric":"query/time","value":40 [13:17:38] ah ok even Event [{"feed":"metrics","timestamp":"2017-10-31T12:53:57.417Z","service":"druid/historical","host":"druid1004.eqiad.wmnet:8083","metric":"query/time","value":261, [13:17:40] elukey: there might be eiher cached queries, or wrong quesries :) [13:18:14] all right I'll try to come up with a list of buckets :) [13:18:42] joal: about the metrics that we'd need - http://druid.io/docs/0.9.2/operations/metrics.html [13:19:01] atm I have these [13:19:30] 'query/time', [13:19:30] 'query/bytes', [13:19:30] 'query/node/time', [13:19:30] 'query/node/bytes', [13:19:30] 'query/cache/total/numEntries', [13:19:32] 'query/cache/total/sizeBytes', [13:19:33] 10Analytics, 10EventBus, 10MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), 10Services (done): PHP out of memory error trying to log big events - https://phabricator.wikimedia.org/T179280#3722998 (10mobrovac) 05Open>03Resolved [13:19:35] 'query/cache/total/hits', [13:19:37] 'query/cache/total/misses', [13:19:40] 'query/cache/total/evictions', [13:19:42] 'query/cache/total/timeouts', [13:19:45] 'query/cache/total/errors', [13:19:49] for all the daemons, but basically historical/broker [13:20:00] hiii fyi stat1005 just disk spaced alerted? trying to check into it... [13:20:21] ottomata: somebody is hammering the host, pretty sure.. [13:20:22] oh wait, maybe not disk space [13:20:29] just alerts broken? [13:20:46] yeah [13:21:15] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=stat1005&refresh=1m&orgId=1 [13:21:43] I am unable to ssh [13:21:55] trying to getinto root console [13:22:14] me too [13:22:21] do you want to do it? [13:22:24] lotsa oom killers [13:22:38] ya i guess i' already into it elukey you look busy with druid :) [13:22:44] not having much luck getting in yet [13:22:47] who could it be! [13:22:56] nono sorry I though you weren't logged in yet :) [13:23:19] i'm in the console, but not able to log in as root because oom [13:23:36] i' in! [13:23:37] i'min! [13:24:04] arnad [13:25:19] :( [13:25:21] gonna kill these [13:25:50] joal: anyhow, if you want metrics from indexing/coordination/ingestion/real-time lemme know :) [13:26:08] elukey: no need for realtime metrics as of today I think [13:26:25] elukey: I was thinking you were after data/oriented buckets [13:26:52] like: how do we bucket time for query/time [13:26:53] oh yes but also overall metrics to return :) [13:27:10] tried to batch questions in one go :D [13:27:41] for the buckets I'll try to come up with a list, I wondered if you had one already in mind [13:29:09] elukey: metrics list is fine for me [13:29:26] elukey: buckets, I think we'd need to experiment [13:29:57] elukey: sorry not to have a better answer [13:30:14] elukey: I'm going to take my break now, ok for ou? [13:30:15] joal: you gave me two perfect answers, thanks :) [13:30:34] I am going to be off later on, and also tomorrow [13:30:40] so talk with you on thursday :) [13:39:37] (I'm at my desk but I don't wanna steal Jo from his break) [13:40:49] ottomata: does puppet on stat1006 need to be disabled? [13:41:58] no! [13:42:00] is it? [13:42:06] remembering why.....oh ya [13:42:07] for aaron [13:42:09] it shoudl not be [13:42:23] my fault that was (as always) supposed to just be for a second [13:42:29] reenabled [13:44:54] halfak: FYI your /srv/halfak dir from stat1003 is now on stat1006 at /srv/stat1003-srv/user_dirs_from_stat1003/halfak [13:44:59] thanks! (I was just checking icinga alerts and noted it) [13:45:33] 10Analytics-Kanban: Locate data from /srv on stat1003 - https://phabricator.wikimedia.org/T179189#3723145 (10Ottomata) I was able to boot up stat1003 and copy this over to stat1006. It is now on stat1006 at `/srv/stat1003-srv/user_dirs_from_stat1003/halfak`. [13:45:39] 10Analytics-Kanban: Locate data from /srv on stat1003 - https://phabricator.wikimedia.org/T179189#3723146 (10Ottomata) [13:45:45] 10Analytics-Kanban: Locate data from /srv on stat1003 - https://phabricator.wikimedia.org/T179189#3716130 (10Ottomata) p:05Triage>03Normal a:03Ottomata [13:47:27] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3723150 (10Ottomata) Ok, I've saved halfak's data. This should be ok to wipe. @cmjohnson, stat1003 is still powered up. I can shut it down (`sudo poweroff`?), or let you do it. (I'm... [13:52:39] a-team: going afk, today only half a day for me (and tomorrow is bank holiday in italy) [13:52:47] talk with you on Thur! [13:52:50] have a nice break elukey [13:52:52] thanks :) [13:53:06] going to help my parents since they are closing their shop before pension [13:53:16] so not really relaxing :D [13:54:51] laters! [13:54:58] elukey: what is their shop!? [13:55:53] they are tobacconist ! [13:56:31] they also sell other things for school, toys, etc.. [13:56:42] one of the old shops having a lot of things to buy :) [13:57:31] that sounds so awesome! ! [13:57:44] elukey: take some pics before it is all shut down! [14:07:34] :) [14:11:05] ah ottomata if you have time for https://gerrit.wikimedia.org/r/#/c/369902/ (wmde change) [14:11:18] Adam submitted it but I forgot to review :( [14:13:09] * elukey off! [14:26:36] 10Analytics-Cluster, 10Analytics-Kanban, 10Language-Team, 10MediaWiki-extensions-UniversalLanguageSelector, and 3 others: Migrate table creation query to oozie for interlanguage links - https://phabricator.wikimedia.org/T170764#3723299 (10Milimetric) @Amire80: any chance to review? I'd like to deploy the... [14:34:00] 10Analytics-Cluster, 10Analytics-Kanban, 10Language-Team, 10MediaWiki-extensions-UniversalLanguageSelector, and 3 others: Migrate table creation query to oozie for interlanguage links - https://phabricator.wikimedia.org/T170764#3723316 (10Amire80) Yes, very soon. [14:43:14] 10Analytics-Kanban, 10Pageviews-API: Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3723354 (10Milimetric) Verified that @MusikAnimal is right, the 404 responses no longer include a bunch of headers, including: access-control-allow-method... [14:51:07] 10Analytics-Kanban, 10Analytics-Wikistats: Alpha release: Label breakdown categories and other non-obvious concepts - https://phabricator.wikimedia.org/T177950#3723416 (10Milimetric) a:05fdans>03Milimetric [14:56:32] 10Analytics, 10Contributors-Analysis: Move contents of ee-dashboards to edit-analysis.wmflabs.org - https://phabricator.wikimedia.org/T135174#3723468 (10Deskana) p:05Normal>03Low [14:57:13] 10Analytics, 10Contributors-Analysis: Bring the Editor Engagement Dashboard back - https://phabricator.wikimedia.org/T166877#3723469 (10Deskana) p:05Triage>03Low [15:01:17] 10Analytics-Kanban, 10Analytics-Wikistats: Handle negative values in charts - https://phabricator.wikimedia.org/T178797#3702916 (10Milimetric) a:05Milimetric>03fdans [15:05:15] 10Analytics-Kanban: Fix mediawiki history page reconstruction bug - https://phabricator.wikimedia.org/T179074#3712458 (10JAllemandou) a:03JAllemandou [15:18:28] 10Analytics: Update Mediawiki Table manuals on wiki - https://phabricator.wikimedia.org/T179407#3723665 (10Milimetric) [15:24:02] question about event logging and data volume. We will be running an AB test soon with much higher data volume than usual, probably about 800k events/day (split across 5 buckets). Not expecting all future tests to use this much, but with some analysis we've seen large event counts allow much more distinction between buckets, and would like to collect enough data to make better decisions [15:24:08] about how much data is needed to measure particular effect sizews [15:24:15] will this be a problem? Would we perhaps need to disable mysql import for that schema? [15:29:52] ebernhardson: hm, so that's about 10 events per second, right? [15:30:31] theoretically that's ok, and with the buffering we do, mysql should be able to keep up with insertion. But as far as space, I'm not sure [15:30:54] to be safe, I'd blacklist it from mysql and analyze it in hadoop [15:31:11] ebernhardson: and what do you mean by "how much data is needed to measure particular effect sizews"? [15:31:31] joal: do you have a link to your gerrit spark 2 change? [15:31:36] just want to look at what you had to change for spark 2 [15:33:36] milimetric: for example at https://people.wikimedia.org/~chelsyx/ctr_distribution.html#dewiki you can see that resampling from 10k sessions down to 1k sessions the distinction between the two groups becomes much more clear. that has an estimated effect size of 1.1% increase in CTR [15:34:29] milimetric: the graph represents bootstrapped CTR's, so basically anything in the colored area could be the true CTR [15:35:17] ebernhardson: that seems like strong evidence to me, I'm all for collecting the data you need to get your answers [15:35:20] milimetric: and how likely a particular ctr is vs another is represented by height. can see that at 5k there is no way you could say with much confidence that the blue is certainly better than the red, at 10k you sorta can but not with a 95%, or even 90% confidence [15:39:42] ebernhardson: right, so yeah, an 800k sample size seems a lot bigger than 10k, did you want to try like 100k to see if that gets you close to the confidence you need, then ramp up from there if needed? [15:40:33] wikimedia/mediawiki-extensions-EventLogging#708 (wmf/1.31.0-wmf.6 - d19bd44 : James D. Forrester): The build has errored. [15:40:33] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/compare/wmf/1.31.0-wmf.6 [15:40:33] Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/295353381 [15:40:35] I gotta go cook lunch now, but let me know if disabling mysql insertion is a pain for you [15:40:36] this test will have 5 buckets, i suppose i could limit it down to less buckets, but was running a normal test we planned to run at the same time. 800k split to 5 groups gives ~160k per bucket, which is certainly still much bigger than before [15:40:51] ebernhardson: right, I see. Yeah, then good [15:41:07] ebernhardson: so is it a pain for you to disable mysql insertion? [15:41:27] milimetric: probably, because we have pre-built analysis that can just be run but it's all based on mysql sourcing. [15:41:39] gotcha [15:41:47] but i think alot of the actual processing happens in R, with the dat ajust sourced from mysql, so maybe it's not too big a pain to query out of hive into a CSV [15:41:52] ebernhardson: do you know if it'll be 10/second or more bursty than that? [15:41:52] i'll double check with chelsy [15:42:34] milimetric: more bursty, i dont have exact numbers but probably more like 20 or 30/s at peak [15:42:39] ebernhardson, cc chelsyx: it's not as easy as a query from Hive, you'd have to first make a Hive table [15:43:01] 30/s at peak should still be ok for mysql, and we can blacklist it as it happens too [15:43:07] but it is a bit much [15:43:16] i'm working a little bit on productionizing eventlogging hive refinement now [15:43:23] sorry it is not 100% avail yet [15:43:40] ebernhardson: has spark foo, it is pretty easy to query from spark :) [15:43:43] ottomata: what do you think about 30/s going into mysql now? Is that rickety thing gonna handle it? [15:43:52] i think elukey has more info on that thaN I [15:43:57] they are setting up new instances [15:43:59] yeah, hm... [15:44:11] but my opinion is to not rely on mysql unless you have legacy stuff where you have to [15:44:26] especially since we want to deprecate the whole thing [15:45:21] ok, ebernhardson: if it's a big pain to switch to spark to pull data out (maybe into Hive and then use the same SQL to query), then I'd say EL can handle another 30/s [15:45:32] it's at like 300/s now, and a single schema (popups) is doing like 280/s [15:45:46] so it wouldn't bring it up to insane levels to add 30/s [15:46:36] ok, thanks! and its no worry if you need to disable it i'm sure i can get the data into a csv with spark, not sure how easy/hard that will be for getting into R but probably not a monumental task [15:46:44] but I agree with ottomata, if you can put it in Hadoop only, it'll help you get up to speed with how this will work in the future. We're not keeping around mysql access after next quarter [15:46:56] wow, thats quick! [15:47:00] yea [15:47:05] it's really dying [15:47:11] we're just sitting next to it with a firehose all the time [15:50:02] (03CR) 10Fdans: [C: 032] Clean up style and tweak UX [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387502 (https://phabricator.wikimedia.org/T178084) (owner: 10Milimetric) [16:00:56] (03PS2) 10Fdans: Handle negative values in graphs [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387561 (https://phabricator.wikimedia.org/T178461) [16:10:20] Hey ottomata - I do have a link for spark 2 change (you might have found it, but in case): https://gerrit.wikimedia.org/r/#/c/348207/ [16:11:11] i just found! [16:11:25] joal: what do you think i should do with this refine stuff related to spark 2? [16:11:43] we had talked about getting spark 2 productionized first and then porting it to spark 2? [16:11:53] if we do that, we shoudl also probably do the same with mforns' druid code [16:11:59] ottomata: You currently have it for spark 1? [16:12:07] 1.6 ya [16:12:34] ottomata: Then I don't mind deploying it as is [16:12:34] following [16:13:16] ottomata: It'd be better to have it with 2 for various reasons, but it means productionisation first [16:13:35] And I'd rather test productionisation of spark 2 with an exisitng job whose changes are minimal [16:13:42] Like a wikidata one for instance [16:14:15] hmm, ok [16:14:19] yeah, and [16:14:30] getting spark 2 'productionized' does not mean 1.6 won't be usable anymore [16:14:31] ok [16:14:35] cool i agree [16:15:16] ottomata: the only thing making me sad in that decision (even if it's the good one), is that we don't push toward 2 :( [16:22:54] joal: ok here is a brainbouncy q for you [16:23:38] the step where i actually insert the data into the hive table [16:23:40] is this [16:23:45] mergedSchemaDf.write.mode("overwrite") [16:23:45] .partitionBy(partitionNames:_*) [16:23:45] .insertInto(partition.tableName) [16:23:55] i'm logging before and after this statement, so i know exactly what partition.tableName is [16:24:07] in this case [16:24:08] otto.CentralNoticeBannerHistory [16:24:20] however, something along the way in the spark/hive code [16:24:23] prints out this error: [16:24:28] ERROR Hive: Table otto not found: default.otto table not found [16:25:00] i know this is in the spark/hive code as there is nothing happening in between in my code [16:25:04] and dataframe.write is all spark [16:25:10] so [16:25:12] i guess my q is [16:25:23] do we care? i was going to fix this one last bug before we merge and productionize [16:25:31] but i'm beginning to think it is a harmless spark bug somewhere [16:25:39] been googling around, haven't found anything related [16:26:12] ottomata: For me it's related to using 'insertInto' with hiveContext [16:26:48] ottomata: Would you mind trying to write parquet not in a hive table, but as files (to double check) [16:29:12] joal: i'm pretty sure that would get rid of the error, since it is somewhere in the spark/hive code [16:29:30] in ERROR Hive [16:29:32] 'Hive' [16:29:36] is then name of the logger [16:31:28] ottomata: That's weird - I have no clue why it comes here - So if it doesn't cause any trouble, I'd say let's forget about it [16:34:40] ok cool [16:34:58] joal: i'm running a rebased refine on eventlogging data now, for the last 24 hours [16:35:02] all schemas [16:35:04] in hadoop [16:35:15] if all goes well, woudl you +1 or +2 this? [16:35:21] then i'll get to making some productionizing patches [16:35:25] I would definitely :) [16:35:39] * joal loves the idea of having EL data in hadoop :) [16:35:59] so, i remember whne i tried to run this unattended regularly in a cron before, it had some troubles [16:36:03] don't remember what they were [16:36:10] and i couldnt; reproduce in a one off CLI session [16:36:10] ottomata: Do you recall which kind? [16:36:14] no not at all [16:36:16] :) [16:36:19] Let's try [16:36:20] tricky spark memory ones maybe [16:36:28] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/387606 [16:36:30] possible ottomata [16:36:30] maybe cluster resource contention? who knows [16:36:31] but ya [16:36:35] let's try [16:36:39] ottomata: what settings do you use? [16:36:40] (03CR) 10Hashar: "check experimental" [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/387606 (owner: 10Hashar) [16:36:49] was usingauto scale [16:36:56] but also, i shoudl probably limit paralleization of jobs [16:37:04] i think it defaults to num CPUs on app master [16:37:04] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/387606 (owner: 10Hashar) [16:37:11] (03CR) 10jerkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [analytics/aggregator] - 10https://gerrit.wikimedia.org/r/387606 (owner: 10Hashar) [16:37:15] so it'll try to launch what, 48 spark jobs at once? :p [16:37:19] https://yarn.wikimedia.org/proxy/application_1504006918778_235514/ [16:37:21] hmm doesn't look so bad [16:37:23] maybe 8? [16:37:47] hmm but exectuors show [16:37:48] shows [16:37:49] active tasks [16:37:52] https://yarn.wikimedia.org/proxy/application_1504006918778_235514/ [16:37:53] 38 [16:37:54] ottomata: The thing is, all those jobs will share Spark Driver resource - [16:38:09] yes given to it by yarn, right? [16:38:19] (03CR) 10Hashar: "check experimental" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/370606 (owner: 10Quiddity) [16:38:29] drier will be on yarn yes - You should give it enough ram [16:38:37] hmmm oh ya maybe that is it [16:38:38] ottomata: --drier-memory 8G [16:38:40] (03CR) 10jenkins-bot: fix name and link of host from labs [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/370606 (owner: 10Quiddity) [16:38:47] more driver mem + maybe fewer parallel [16:38:53] and restrict parallelization to 8 [16:39:01] we don't need to refine 48 at once [16:39:01] yeah [16:39:09] ok [16:39:27] For workers, I also suggest you use --executor-cores 3 --executor-memory 6G [16:39:27] (03CR) 10Hashar: "check experimental" [analytics/multimedia] - 10https://gerrit.wikimedia.org/r/324379 (https://phabricator.wikimedia.org/T98449) (owner: 10Gergő Tisza) [16:39:36] ottomata: --^ [16:39:45] (03CR) 10jenkins-bot: Fix SQL queries [analytics/multimedia] - 10https://gerrit.wikimedia.org/r/324379 (https://phabricator.wikimedia.org/T98449) (owner: 10Gergő Tisza) [16:39:47] (03CR) 10Hashar: "check experimental" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/371964 (owner: 10Bearloga) [16:40:20] and ottomata, also, ;limit max-number of executors in dynamic allocation: --conf spark.dynamicAllocation.maxExecutors=16 [16:40:53] ok [16:40:57] ottomata: - actually you could go for more executors: 32 [16:41:07] or even 64 [16:41:22] better to make it static like that? rather than let it be dynamic? [16:41:47] (03CR) 10jenkins-bot: README.md: update link [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/371964 (owner: 10Bearloga) [16:41:51] ottomata: this setting (--conf spark.dynamicAllocation.maxExecutors=XX) is about maximum number of executors in dynamic mode [16:42:11] So it just means you won't eat all resources even if you could ;) [16:42:34] ohoh right [16:42:35] k [16:42:43] gr8 [16:43:13] ebernhardson milimetric: Sorry I'm late for the party. No, it shouldn't be very hard to get data from hadoop into R, but it will take a long time for R to process using our prebuilt analytics code [16:43:19] Given there should be maybe small jobs, using multicores could help [17:01:02] a-team: we doin' staff? [17:01:09] ops coming [17:01:16] oh, cancelled [17:01:42] so canceled we could call it house of cards [17:01:56] ha [17:02:02] ok, I'll look at your patch next fdans [17:02:09] thank youuuu [17:02:11] it must be another bug I didn't see [17:03:46] 10Analytics, 10Proton, 10Readers-Web-Backlog, 10MW-1.31-release-notes (WMF-deploy-2017-10-31 (1.31.0-wmf.6)), and 2 others: Implement Schema:Print purging strategy - https://phabricator.wikimedia.org/T175395#3724064 (10Jdlrobson) https://gerrit.wikimedia.org/r/386911 is on the train and will not be deploye... [17:25:22] gone for diner a-team, will be back afer [17:30:50] joal: refine succeeded with your settings [17:30:57] took a while to finish with popups, since it was bigger [17:31:22] also ChangesListHighlights looks like it has some incompatible field changes in it [17:31:27] but that's ok, we can just skip those ones [17:31:57] (03PS42) 10Ottomata: JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [17:32:20] (03CR) 10Milimetric: [C: 032] Handle negative values in graphs [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387561 (https://phabricator.wikimedia.org/T178461) (owner: 10Fdans) [17:32:34] joal: when you get back, gimme pluses on https://gerrit.wikimedia.org/r/#/c/346291/! :) [17:34:37] (03CR) 10jerkins-bot: [V: 04-1] JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [17:39:00] 10Analytics, 10Analytics-Wikistats: Deal with zero divisions in "small metrics" - https://phabricator.wikimedia.org/T179424#3724156 (10fdans) [17:42:19] (03PS1) 10Fdans: Release 2.0.10 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387617 [17:46:05] (03CR) 10Fdans: [C: 032] Release 2.0.10 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/387617 (owner: 10Fdans) [17:50:17] (03PS1) 10Fdans: Release 2.0.10 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/387621 [17:56:34] (03CR) 10Fdans: [C: 032] Release 2.0.10 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/387621 (owner: 10Fdans) [18:20:28] mforns: i wrote https://www.mediawiki.org/wiki/Reading/Web/Advice_when_writing_EventLogging_schemas [18:20:38] and added an edit on the linked page on wikitech you gave me [18:20:41] anything further to add? [18:57:38] jdlrobson, looking [19:00:19] holaaaa [19:00:25] back online, sorry everyone [19:01:16] jdlrobson, thanks for your edits, the "More than one schema?" part is great! Thinking about it now, if the schema is not sampled, it might be useless to split it, unless it is very high thoughput, because we have the timestamp field that can easily link the two split parts of the original schema [19:01:41] luckily, the sampling guarantees that timestamp field is not linking [19:03:06] well... the sampling should be done separately: different sampling user choices for both schemas [19:03:17] hey nuria_ :] [19:16:30] mforns: hola [19:17:05] ebernhardson: ottomata mforns : since we agree on moving the AB testing that ebernhardson is doing to hadoop [19:17:38] mforns: we can add it to the preliminary refine crons we have so he can extract the data more easily right? [19:18:00] i know ebernhardson knows the ways of spark but refining here will make more sense, ideas? cc ottomata [19:18:10] nuria_, which table would that be [19:18:12] ? [19:18:13] ya agree [19:18:29] ottomata: well, once ebernhardson has schema we can name table [19:18:31] ottomata: I'd give you pluses, but jenkins doesn'T [19:18:36] who knows, maybe by next week we'll have producitonized refeine for all/most schemas! :o [19:18:44] :D [19:18:45] jenkins is stupid [19:18:46] O.o [19:18:53] you + me i'll deal with JENKINS [19:19:01] * joal loves mforns with big eyes [19:19:13] mforns: just like popups goes into popups table this can go in whatever pretty-name tbl [19:19:20] k [19:26:08] (03CR) 10Joal: [C: 032] "LGTM !" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [19:26:21] ottomata: I let you tame jenkins :) [19:27:38] :) thank you! [19:27:57] (03CR) 10jerkins-bot: [V: 04-1] JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/346291 (https://phabricator.wikimedia.org/T161924) (owner: 10Joal) [20:07:57] ebernhardson: let us know when you are back [20:10:01] nuria_: oh i missed above. The table is TestSearchSatisfaction2_16909631 test will probably run tomorrow afternoon for 7 days [20:10:29] ebernhardson: let's blacklist it on mysql right cc mforns ottomata [20:10:40] * ebernhardson knows that is the worst name ever for a schema ... but we never bothered fixing it after understanding how EL works [20:10:58] * nuria_ was going to let that one pass [20:11:09] :) [20:11:17] mforns: , ottomata confirm that you are ok blacklisting this please [20:11:48] +1 [20:12:58] ebernhardson: ok, give me a sec [20:16:30] 10Analytics-Cluster, 10Analytics-Kanban: Make Spark 2.1 easily available on new CDH5.10 cluster - https://phabricator.wikimedia.org/T158334#3724588 (10Ottomata) a:03Ottomata [20:19:05] ebernhardson: https://gerrit.wikimedia.org/r/#/c/387654/ [20:19:45] ottomata to merge [20:20:05] ebernhardson: as soon as it is merged data will no longer show up on mysql, makes sense? [20:22:44] 10Analytics-Kanban: Locate data from /srv on stat1003 - https://phabricator.wikimedia.org/T179189#3724596 (10Halfak) Confirmed. I'm pulling data off of it. Thanks for the help! [20:23:04] nuria_: i was just thinking about that, and how its going to break our dashboarding since that also sources from mysql. Better solution: I'll make it so normal data goes to that table, but anyone in this test logs to 'SearchSatisfaction_17374996', which has exact same schema (but different name/revision). While the dashboarding does need to be updated, it will be something that can be [20:23:11] prioritized instead of just broken. Also gives me an opportunity to fix this horrible name :) [20:23:46] ebernhardson: sounds great, let me ammend commit [20:24:16] o/ nuria_ [20:24:32] Would you like me to resolve tasks like https://phabricator.wikimedia.org/T179189 myself? [20:24:33] halfak: hola [20:24:38] or wait for one of analytics to do it? [20:24:56] halfak: i normally do it in batches [20:25:03] Ok I'll leave it alone then :) [20:25:04] Thanks [20:25:12] * halfak likes to batch them too [20:25:15] halfak: cause i like to look at our code and nomally cannot do it realtime [20:25:31] halfak: in this case it does not matter but just so you know [20:29:34] nuria_: i think you commited somethign funky against an old ersion [20:29:35] reabse conflict [20:29:38] with Popups [20:29:41] can you rebase and resubmit? [20:30:35] ottomata: ah sorry i though i did that [20:34:37] ottomata: hopefully fixed [20:37:02] nuria_: can you handle restarting el mysql to pick up change? [20:37:11] ottomata: yessir [20:37:16] k one min... [20:38:53] ottomata: k you let me know [20:39:07] nuria_: go ahead [20:39:17] ottomata: ok, let me pull out graphs [20:44:50] ottomata: killing that process killls it out cold [20:45:03] ottomata: it does not empty memory queue [20:45:08] ottomata: anyways... [20:46:05] does not empty memory queue? [20:46:16] oh, how are you killin git? [20:48:31] FYI a-team, the new editing metrics are live on Wikistats 2! https://stats.wikimedia.org/v2/ [20:49:24] Yay fdans !!!! [20:52:12] fdans, woohooo! [20:54:36] ottomata: i am killing it with upstart [20:54:40] ottomata: so nice kill [20:55:06] ottomata: but i would expect to see some logging that i do not see [20:55:19] fdans: WHATATA [20:55:24] fdans: SO CLOSE!!!!!! [20:56:07] joal: did you get anything new from erik's work? [20:56:29] nuria_: nope, but got news from me working [20:56:36] joal: ahahaha [20:56:49] joal: anything you want to share [20:57:36] nuria_: We've been mistaking in how we work restores in pages [20:58:07] joal: overcounting and thus reducing deletions? (or the opposite) [20:58:32] and thus increasing new article creation ? [20:58:37] or not that suimple? [20:58:41] *simple [20:59:24] nuria_: acutally, we were expecting a deletion before a restore, while restores can happen with existing pages (restore old revisions on an existing page) [20:59:40] joal: aham [21:00:01] joal: worth documenting , right? [21:00:13] nuria_: The problem in he way we work is that it breaks lineage of page histories, therefore preventing correct history denormalization [21:00:22] nuria_: milimetric has a task created for that [21:01:01] joal: that seems like it would be different from teh issue in which before 2014 our metrics differ more , right? [21:01:11] joal: than earlier wikistats ones that is [21:01:41] not sure nuria_ [21:01:49] 2014 was related to deletions [21:02:43] joal: aham [21:10:38] joal: so we will correct known bugs, rerun snapshot and recompare? [21:11:53] nuria_: that's the plan [21:12:13] joal: ok, getting closer! [21:12:41] nuria_: but I'm hitting a difficult aspect now [21:13:00] joal: aham? [21:15:01] nuria_: I'm double checking that the new behaviour for restores also work for special cases we covered in the previous way of dealing with it [21:15:04] :s [21:15:06] regression checks [21:15:15] joal: puf [21:28:32] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Make Spark 2.1 easily available on new CDH5.10 cluster - https://phabricator.wikimedia.org/T158334#3724766 (10Ottomata) OOooOOO boy! ``` [@stat1005:/home/otto] $ ls /usr/bin/*spark2* | cat /usr/bin/pyspark2 /usr/bin/spark2-beeline /usr/bin/spark2R... [21:44:42] nice ! [21:48:33] pff - I might have spoken too fast - While it feels reasonable to have partial restores, I really can't find any way to do it - I might actually be wrong [22:25:50] 10Analytics, 10Analytics-Wikistats: Wikistats Bug – In tabular data view, format displayed values - https://phabricator.wikimedia.org/T179441#3724893 (10Jdforrester-WMF) [22:28:29] 10Analytics, 10Analytics-Wikistats: Wikistats Bug – Don't display time range difference over all time - https://phabricator.wikimedia.org/T179443#3724926 (10Jdforrester-WMF) [22:36:11] 10Analytics, 10Analytics-Wikistats: Wikistats Bug – Put view settings in URL so it can be shared - https://phabricator.wikimedia.org/T179444#3724954 (10Jdforrester-WMF) [22:49:30] milimetric: question if you may? [22:55:34] milimetric: I'll need help tomorrow - I don't find my way outa here :) [22:55:37] Gone for tonight [22:56:43] nuria_: hey [23:56:42] probably a bit late but from stat1005: ls: cannot access '/mnt/hdfs': Transport endpoint is not connected [23:56:47] (03PS2) 10Mforns: [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [23:57:01] i'm out for the day so i'll just check into it tomorrow [23:58:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add scala-spark core class and job to import data sets to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns)