[03:49:50] 10Analytics, 10Analytics-EventLogging, 10AbuseFilter, 10CirrusSearch, and 30 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3595275 (10Nuria) [04:13:32] 10Analytics-Kanban, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3595282 (10Nuria) @JAllemandou let's please document this dataset throughly once available, i will add it to the goals of next quarter for visibility [07:55:12] 10Quarry: Provide a way to hyperlink Quarry results/output - https://phabricator.wikimedia.org/T74874#771075 (10Tgr) Ideally this would be defined in the query code, e.g. ```lang=sql SELECT page_namespace, page_title, -- @quarry:format page_title el_to, -- @quarry:format url gt_lat, gt_lon, -- @quarry:f... [07:59:56] 10Quarry: Make query URLs have a sluggified version of the title in them - https://phabricator.wikimedia.org/T75885#784350 (10Tgr) It doesn't sound too expensive to just generate the slug from the title on the fly, without storing any new data. [08:04:36] 10Quarry: Allow comments on queries - https://phabricator.wikimedia.org/T71543#720495 (10Tgr) Just export query description and SQL to a wiki page (with a Flow talk page), subscribe query owner to it? Not too much effort and would also solve {T90509}. [08:06:59] 10Quarry: Have an easy way to ban users from Quarry - https://phabricator.wikimedia.org/T104322#1413399 (10Tgr) If the concern is abusive users (as oppose to well-intentioned but clueless ones), just block them on meta and rely on the OAuth block data? [08:11:06] morning! [08:11:15] so the eventlogging_cleaner script seems not proceeding very well [08:11:21] db1047 is again overloaded [08:11:32] I tried to stop mysql replication for wikis + stop eventlogging_sync [08:11:48] load went down but it takes minutes to update ~10k rows [08:12:36] I am not sure what is the best way to proceed.. [08:12:48] for sure it would not be a problem with the new hardware [08:13:05] I am starting to think that it might be wiser to wait for the new hardware replacements [09:24:16] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jul-Sep 2017): Automatically sync mediawiki-identities/wikimedia-affiliations.json DB dump file with the data available on wikimedia.biterg.io - https://phabricator.wikimedia.org/T157898#3595645 (10Aklapper) Upstreamed as https://gitlab.com/Bitergia/c/... [09:25:37] 10Quarry: Provide a way to hyperlink Quarry results/output - https://phabricator.wikimedia.org/T74874#3595648 (10Tgr) As a very simple first step, fields that match an URL regexp could be linkified. [09:46:31] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#3595705 (10mobrovac) [09:46:34] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3595700 (10mobrovac) 05Open>03Resolved Everything is set up now, and the `cpjobqueue` service is live in production on t... [10:50:07] (03CR) 10Joal: "Setup seems correct, doesn't it? :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/376797 (https://phabricator.wikimedia.org/T174796) (owner: 10Shilad Sen) [10:53:51] (03PS2) 10Joal: Correct oozie jobs loading pageviews in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/375766 (https://phabricator.wikimedia.org/T161824) [10:53:59] (03CR) 10Joal: [V: 032] Correct oozie jobs loading pageviews in druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/375766 (https://phabricator.wikimedia.org/T161824) (owner: 10Joal) [10:56:15] * elukey waves to joal [10:57:17] * elukey lunch! [11:25:12] taking a break a-team [12:09:29] helloooooo team :] [12:17:01] mforns: o/ [12:17:10] hello elukey :] [12:20:12] mforns: the eventlogging_cleaner script works fine but it tend to trash the analytics slaves [12:20:27] elukey, aha [12:20:38] maybe we can increase the sleep time, now that it's faster? [12:20:39] I tried different combinations but at some point the disk saturation rises to nearly 100 [12:20:52] oh... mmmh [12:20:53] I tried, now it is running with 10k batches + 4s of delay [12:21:08] it works like a charme for a bit then the dbs are not fine [12:21:13] I see [12:21:31] how much time does it take to trash them? [12:21:40] but I am pretty sure this is mostly related to the fact that those hosts are super old, without ssds etc.. [12:22:08] it depends on the load of the db, like normal queries, eventlogging_sync, etc.. [12:22:09] and what does it take to restore the slaves? is it just waiting? [12:22:29] they keep going but with disk saturated everything takes ages to complete [12:22:38] like an update query takes minutes rather than seconds [12:22:56] I see [12:23:10] I can try to lower down again the batch size to say 1000 [12:23:16] and 5s of delay [12:23:26] but then we'll not make it in time for September [12:23:30] elukey, why is the disk being saturated? [12:23:49] is it temporary data? [12:24:03] mforns: still not sure, but updates are heavy on dbs, and we have super old hosts [12:24:13] I am pretty sure that with then new hw we wouldn't see this mess [12:24:18] aha [12:24:25] so I am wondering if it makes sense to wait for the new hw [12:24:42] aha [12:29:32] https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=db1047&refresh=1m&orgId=1&from=now-7d&to=now&panelId=19&fullscreen [12:29:37] basically this is what happens [12:29:46] lookin [12:29:51] i noticed that when disk saturation is 100% then everything slows down a lot [12:30:02] but for example, on analytics-store the baseline is already that high :D [12:30:31] elukey, I found this, no idea if makes sense though: https://2bits.com/articles/reduce-your-servers-resource-usage-moving-mysql-temporary-directory-ram-disk.html [12:32:02] there might be a lot of tuning to do on those mysql instances, Jaime/Manuel could probably help but I am trying to think if it makes sense to spend tons of time of that when the new hw will arrive in Q2 [12:32:36] Hello everyone, I'm having trouble with the entries in wmf.wdqs_extract on 23.08.2017,specifically the hours 9-17. Trying to access them results in "Hive internal error: conversion of string to arraynot supported yet." All other hours (and days) work fine. Can anyone help me out with this? [12:36:53] arnad, Hi! I think this is a problem that we had a couple weeks ago. It was fixed, but apparently wmf.wdqs_extract still has some corrupt values. I think joal knows if it's fixable and how to fix? Not sure if he's here now though. [12:39:02] I'll ask him about it, thanks! [12:41:50] mforns: so I just stopped eventlogging_cleaner on db1047 - https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=db1047&refresh=1m&orgId=1&from=now-1h&to=now&panelId=19&fullscreen [12:42:03] but as you can see the background is still high [12:42:06] aha [12:45:14] now I am stopping eventlogging_sync [12:49:14] elukey, oh now it makes a huge difference [12:51:12] mforns: last one are the mysql slaves [12:52:08] show full process list is empty now [12:53:14] now why the hell mysqld takes 50% of the disk usage? [12:53:33] ah no it is dropping heavily [12:54:14] ok [12:54:40] but as you can see eventlogging_cleaner is not the worst offender [12:55:01] aha, looks like the sync script is taking a lot of disk [12:55:38] last one is mysql replication of the wikis [13:04:52] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3596390 (10elukey) {F9456503} The above graphs shows more or less the disk saturation on db1047 (analytics-slave) caused by: * eventlogging_cleaner.py... [13:04:56] mforns: -- [13:04:58] --^ [13:05:10] reading [13:06:11] yea, mmm [13:07:03] 10Quarry: The star icon will not show/load in Quarry's star button - https://phabricator.wikimedia.org/T175466#3596392 (10zhuyifei1999) IIRC, you can't star your own query... [13:14:40] this could also be a tuning issue with tokudb [13:29:55] mforns: with batch sizes of 1000 and 4s of delay on dbstore1002 seems going "fine"ish, namely a batch takes ~10/15s to complete [13:30:08] that is too much, because it means 4k rows / minute [13:30:17] but better than nothing [13:30:52] 240 rows / hour [13:31:18] err sorry, 240k [13:31:52] ~5.5M per day [13:32:04] never gonna make it against the 1B tables :D [13:38:08] * elukey cries [13:38:13] xDDDD [13:39:17] * elukey executes drop table [13:39:50] ??? [13:41:05] kidding [13:41:17] xDDDD [13:48:01] ok I started the script on both slaves with batch sizes 1000 and sleep 4s [13:48:53] ok, cool! [13:49:12] hi a-team! [13:49:19] I'm back! [13:49:23] also: what is programming? [13:50:00] milimetric!! :D [13:50:22] milimetric: o/ o/ o/ [13:50:25] Hi milimetric !!!!! So happy to see you :) [13:50:39] heya, I'm gonna catch up on some email but lemme know if there's anything urgent or if you just wanna hang out [13:51:04] likewise, very happy to be back actually [13:51:20] I thought I'd be more torn, but it's nice being at home where baby's just next door if I wanna visit [13:53:16] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Improve purging for analytics-slave data on Eventlogging - https://phabricator.wikimedia.org/T156933#3596782 (10elukey) Started eventlogging_cleaner.py again on both slaves, with batch-size 1000 and sleep 4s. This seems to work in a reasonable amount of... [13:55:40] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3596789 (10elukey) a:05RobH>03Cmjohnson Assigning back to Chris as discussed on IRC: we'd need to move the Kafka Jumbo hosts out... [13:56:15] milimetric, yes, hangout! [13:56:37] ok, 5 minutes? [13:56:55] milimetric, yep [14:08:21] (03CR) 10Elukey: "Looks good so far, but I have a question: do exception stack traces fall into the WARNING/ERROR/etc.. logging that goes to stderr? (just t" (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/355601 (https://phabricator.wikimedia.org/T162034) (owner: 10Mforns) [14:12:50] ottomata: o/ [14:18:46] hiiii [14:20:12] heya! [14:21:54] ottomata: just pinged Chris and he is going to assing new ips to kafka-jumbo, and then we'll need to reimage [14:22:13] so atm the new cluster is not usable :( [14:22:27] yeahhhhh, saw that :( [14:22:55] elukey: not so sure about this coodinator name :) [14:24:08] ottomata: well we can call it whatever we want at this stage, I wanted to move things away from site.pp and reduce the hiera config.. I like coordinator, it is what an1003 does more or less, buuuut I'll be ok with whatever name [14:24:36] my goal is to have eventually only one role per host in site.pp [14:24:46] and then work on profiles etc.. [14:25:54] ya like the idea, just not sure about the name [14:26:02] its not really 'coordinating', its just a bunch of stuff on one node [14:26:37] orchestrator? [14:26:51] naw, that makes it sound like it has some official role [14:27:06] (this is one reasoni don't love the one role per node thing, makes you make up names) [14:27:08] :) [14:27:11] hahahaha [14:27:41] I always pictured an1003 as the central node for hive/oozie/etc.. so this is why I found coordinator natural [14:28:03] it drops data when needed, etc.. [14:28:28] just feels weird though, an1003 is not special, it was just convenient to put those things all there [14:29:11] not really sure what is better though [14:29:14] that's a tough one [14:31:20] it is a black hole: all the cluster gravitates around it and whenever it is down it absorbs most of the jobs in its event horizon :D [14:31:51] hi guys [14:32:24] hi dsaez, what's up [14:33:45] good, fighting with pyspark on the stat1005... do you know whom can i ask about it? [14:34:17] I think there something misconfigured [14:34:22] dsaez: we don't have a lot of experience with pyspark, but folks on the discovery team do [14:34:29] ebernhardson: probably [14:34:31] but you can ask here too! [14:34:37] dsaez: You shouldn't use spark on stat1005 - It has java 8, incompatible with our cluster [14:34:38] maybe we can help anyway! [14:34:42] dsaez: Please use stat1004 [14:34:51] ok! going there! [14:36:19] same problem, so basically, I'm trying to load an external package. I used to do this in other installations without problems, going like this: pyspark --packages graphframes:graphframes:0.1.0-spark1.6 [14:36:47] however, where it gets stoped: [14:36:52] graphframes#graphframes added as a dependency [14:36:52] :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 [14:36:53] confs: [default] [14:37:05] dsaez: I have not been to load external packages, given our machines are not connected directly to the qweb [14:37:15] dsaez: If you do though, I'm interested ! [14:37:38] yeah, that is probably not gonna work. you'll have to dl the dependencies and add them in some other way [14:37:52] can't dl from internet via pip or maven whatever [14:38:29] dsaez: this might help [14:38:29] https://wikitech.wikimedia.org/wiki/HTTP_proxy [14:38:31] but not a lot :) [14:39:24] ottomata: I've done this for using venv and pip ..works, but not for spark, I guess this is because this env. variables doesn't go to the master [14:39:40] (03CR) 10Fdans: [C: 031] "Looks good to me!" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/373961 (https://phabricator.wikimedia.org/T174174) (owner: 10Joal) [14:39:54] joal: I find that package really useful [14:40:14] I can try to download, and install localy [14:40:21] dsaez: Wich one, graphframes? [14:40:27] yes [14:40:49] it's kind of wrapper for graphX ...but a lot easier to use [14:42:29] dsaez: I think I have solution for you then, you can use "--jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar" --> this jar files has GraphFrames 0.3 embeded [14:43:34] joal: , he's using pyspark though, no? [14:43:44] ottomata: Arf :( [14:43:51] hm, actually, I don't know ! [14:44:37] should be ok from what I read ... Weird [14:45:12] pyspark accepts the jars, but: [14:45:13] >>> from graphframes import * [14:45:13] Traceback (most recent call last): [14:45:25] ImportError: No module named graphframes [14:46:33] ottomata: did you see https://gerrit.wikimedia.org/r/#/c/376663/ ? I am trying to rebuild the jmxtrans pkg and master seems ok afaics, have you ever built it ? (maybe for stretch) [14:47:22] saw but havn't looked yet, looking now! [14:47:41] i have built it, been a while, i think nick (before your time) tweaked it a bit [14:48:14] hola milimetric ! [14:48:21] hi! :) [14:49:12] dsaez: https://github.com/graphframes/graphframes/issues/172 [14:49:53] joal: looks good, will try, thanks [14:50:14] np dsaez, sorry that it's not a bit more straighforward ! [14:50:34] dsaez: Do you mind giving me a hint on what kind of graph analysis you're doing?n [14:52:16] I'm trying to cluster wikipedia Categories, by creating a graph of co-ocurrences, and then I want to run a community detection algorithm. The idea is that categories in the same community/clique/cluster will be semanticaly related [14:53:40] so, I have a graph with around 40K nodes, and ... (i don't remember how many edges) [14:53:44] dsaez: ok! Nice ! Form category page I assume? [14:53:57] category table table I mean, sorry dsaez [14:55:48] no, I go to each article, extract the categories from there. So If one article is Jazz, Artist, Music , I'll create these edges: jazz-music, jazz-artist, artist-music [14:56:09] dsaez: From text? [14:56:45] a collaborator gave this list, I think he got from the sql dumps. [14:57:02] Ah, ok :) [14:57:27] elukey: i don't see master continaing debian dir [14:57:49] HMM [14:57:50] but it does [14:57:51] Hmm [14:57:52] wha [14:57:53] dsaez: I assume it comes from a SQL table, I'm interested to know more (when you know more, obviously), and also interested in the results you get from analysis :) [14:58:16] ah, i just pulled from github remote [14:58:19] upstream [14:58:27] elukey: do you want me to try and build this? [14:58:30] with your change? [14:58:31] for stretch? :) [14:59:17] ottomata: I had all your reactions when checking out the repo branch :D [14:59:20] haha [14:59:58] elukey: i am not totally sure of the status of this repo [15:00:06] we def aren't synced with upstream master [15:00:19] since it has so many changes...and there was something that kept us from uprading [15:00:24] that might be fixed now... [15:00:58] ping fdans ottomata [15:01:12] 10Analytics, 10Analytics-Wikistats: Line graph-related Wikistats changes - https://phabricator.wikimedia.org/T175582#3597059 (10fdans) [15:01:16] ping mforns [15:01:23] yep [15:01:25] (03CR) 10Joal: Add mediawiki history edits metrics endpoint (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/373961 (https://phabricator.wikimedia.org/T174174) (owner: 10Joal) [15:04:56] joal: will tell you. Here you have few details: https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_stubs_across_languages/Ranking_sections_within_categories [15:06:20] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3586259 (10mobrovac) IMHO, `updateBetaFeaturesUserCounts` is the perfect candidate here. It's very lightweight (one `SELECT`... [15:07:27] another question: do you use any IRC bouncer? I was wondering if i can install one in some of our machines (quassel?). I've also read that is possible to get an ircloud acccount [15:11:42] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Audit users and account expiry dates for stat boxes - https://phabricator.wikimedia.org/T170878#3597109 (10elukey) a:05Ottomata>03elukey [15:12:37] dsaez: I don't use any irc bouncer no, I can't help on that [15:21:22] 10Analytics: Pagecounts-ez not generating - https://phabricator.wikimedia.org/T172032#3483493 (10Milimetric) Looks like this is resolved, at least files were written after the date of the bug, @Erik_Zachte anything you need help with before closing the task? [15:37:34] dsaez: i also have heard you can get an IRC cloud account, not sure who to ask though [15:37:40] maybe ask in #wikimedia-operations [15:39:18] i think oit hands out the licenses for those [15:39:22] ottomata, thx [15:40:09] I think I'll try first with some free solution [15:40:18] https://firrre.com/ also provides a bouncer, but they've stopped accepting new users átm [15:46:26] ping joal [15:57:43] dsaez: ever figure out the pyspark stuff? looks like maybe [15:59:48] (03PS1) 10Nettrom: Add page creation dashboard configuration [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/377290 [16:01:16] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3588533 (10GWicke) Yay! 🎆 [16:06:49] ebernhardson: not yet [16:07:26] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 3 others: Visualize page create events for all wikis - https://phabricator.wikimedia.org/T170850#3597397 (10Nettrom) @Nuria : I added a short note to [[ https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Dashboards#Test_lo... [16:09:11] dsaez: you're trying to pull in java dependencies, or python? I do both but perhaps not in a normal way [16:10:09] dsaez: for java dependencies i build a 'batteries included' jar that has all the dependencies, and provide it on the command line. For python dependencies i use pip on stat1005 with the HTTPS_PROXY set, then zip that up and provide to --archive so it's distributed to the workers [16:10:34] joal recommended to use stat1004 [16:10:55] i find stat1005 better because the version of pip there can use wheel packages in python and doesn't try to compile the owrld [16:11:09] ebernhardson: no problem due to different java versions? [16:11:17] joal: not yet [16:11:19] joal or ottomata: Have you run spark jobs against the webrequest log before? [16:11:32] ebernhardson, dsaez : No prob for me then :) [16:11:38] Shilad: We have [16:12:31] joal: I have an outline for my job and have been testing it. I'm a little worried about performance. The RDD join between webrequests and page ids is a little concerning. [16:12:33] joal: actually 1004 was more of a problem, because the python versions were different. If i built a virtualenv on 1004 and shipped it to the executors they couldn't run the python executable [16:12:44] i had to build the python venv on a local virtual machine and ship that over [16:13:00] sounds good ebernhardson :) [16:13:00] I'm trying now from stat1005 [16:13:49] Shilad: It is - a simple way, for testing, is to extract / copy a small amount of data and run against that [16:14:15] joal: I've been testing against simple English, with a few hours of views. [16:14:48] joal: How long do the clickstream jobs typically take? [16:14:53] this solution didn't work: https://github.com/graphframes/graphframes/issues/172 [16:15:28] doing anything with webrequests is time consuming, i 've had my best luck setting up an oozie job that pre-processes webrequests into the data i need on an hourly basis, then running my jobs againts that pre-processed data. If you're doing one time analysis though thats obviously overkill [16:16:10] yep :( [16:16:22] ebernhardson: My hope is this becomes a regularly-run job. [16:16:52] Shilad: I'm not afraid of enwiki-page, however 1 hour of webrequest is already 25G [16:16:52] Ok, i've installed pyspark using pip, but I think installed for spark 2.2, and the spark version on the cluster is 1.6 [16:17:37] dsaez: i use 2.1.0, although i didn't install it via spark i just grabbed the download from spark's downloads area for hadoop 2.6 [16:17:46] s/via spark/via pip/ [16:18:03] joal: The way I've done this in the past is to create a compressed representation of session activity that lags a few days behind. That way, you only need to process one day's worth of views per day. [16:18:12] Does that seem reasonable? [16:18:13] dsaez: basically it just means spark will upload all the jars into a temp area on hdfs every time you run a job [16:18:26] ebernhardson: that means to have a local installation of spark? [16:19:10] Shilad: Our jobs work hourly on regular basis - We can have a daily one - Also, given you probably are mostly interested [16:19:17] dsaez: yes i have a local install of spark basically on my home dir in stat1005, and when i run `~/spark-2.1.0-bin-hadoop2.6/bin/spark-submit ...` it copies all the appropriate stuff into a temp dir on hdfs which lets the executors find it all [16:19:59] ebernhardson: so, you just use one machine? [16:20:03] in pageviews Shilad, there will be a table having full webrequests fields and only pagviews soon [16:20:11] dsaez: no, i use the full cluster [16:20:39] wow [16:20:50] I didn't know that this was possible, sounds as good option [16:21:03] a full commandline for my thing ends up looking something like: PYSPARK_PYTHON=venv/bin/python SPARK_CONF_DIR=/etc/spark/conf ~/spark-2.1.0-bin-hadoop2.6/bin/spark-submit --jars /home/ebernhardson/mjolnir-0.1-jar-with-dependencies.jar --driver-class-path /home/ebernhardson/mjolnir-0.1-jar-with-dependencies.jar --master yarn --files /usr/lib/libhdfs.so.0.0.0 --archives [16:21:09] 'mjolnir_venv.zip#venv' venv/lib/python2.7/site-packages/mjolnir/cli/data_pipeline.py -i 'hdfs://analytics-hadoop/wmf/data/discovery/query_clicks/daily/year=*/month=*/day=*' -o 'hdfs://analytics-hadoop/user/ebernhardson/mjolnir/20170908_cswiki' -q 1000000 -c codfw cswiki [16:21:47] and that ships all the spark jars into hdfs, along with the custom jar and the mjolnir_venv.zip, the executors launched by yarn load all that stuff [16:22:36] the driver i suppose does stay running on stat1005, but the executors are in the cluster [16:24:07] joal: Can you tell me more about this table having full webrequest fields and pageviews? Are you saying that there will be a table that has basic web request info + page ids? [16:24:11] ebernhardson: sounds good, [16:24:42] ebernhardson: mjolnir-0.1-jar-with-dependencies.jar, whould be new package? [16:25:16] Shilad: We have a project to split webrequest into smaller tables based on tags (pageviews being one of them) - Those tables would be copies of the web [16:25:23] dsaez: yes thats a custom jar i build with maven that includes some dependencies i needed (xgboost, kafka, and spark-streaming-kafka) and a tiny bit of clue code that made calling it from python easier [16:25:36] I see [16:25:36] webrequest one in term of fields, but with less rows, therefore easier to work with [16:26:26] sounds good, I've still don't undestand why the webrequest doesn't work, considering that I've already setup the proxy [16:26:49] dsaez: trying to make web requests from the executors? [16:27:37] hmm... good point, I'm not sure how this --packages arg. works in the executors [16:29:34] dsaez: ahh, i havn't tried to load java dependencies that way before. To avoid going out to the internet can use archiva.wikimedia.org as the repository but need to get the packages uploaded there too i suppose [16:29:55] annoyingly we never managed to get ldap setup for archiva, so there is just a shared password somehwere.. [16:30:54] solved! [16:30:58] pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --conf "spark.driver.extraJavaOptions=-Dhttp.proxyHost=webproxy.eqiad.wmnet -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy.eqiad.wmnet -Dhttps.proxyPort=8080" [16:31:04] yeah! [16:31:06] :)_ [16:31:20] joal: Those pruned down tables sound useful! I also think I'm going to test the performance of rewriting the job so that it broadcasts a data structure mapping page titles -> page ids to avoid the join. I have an intuition that the join is slowing things down. We will see. [16:31:46] Where should I documment this? might be helpful for other in the future [16:32:06] Shilad: feasible - I wonder about the size of such a structure :) [16:32:14] dsaez: probably https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark ? [16:33:04] I think I can use a 64 bit hash instead of page title, so 12 or 16 bytes per article -> about 100 MB for EN [16:33:10] Thanks ebernhardson for helping dsaez fixing :) [16:33:25] Shilad: Sounds good, even if big :) [16:34:02] yes! Thanks joal, and ebernhardson :=) [16:34:30] Shilad: I've not used broacast a lot, so I can't really say, but let's try :) [16:35:14] Shilad: It also depends on what data you need from other pages [16:35:49] joal: I think for now I only need page id, so it's pretty small. [16:36:02] Shilad: fwiw spark has a broadcast hash join that it uses if it thinks it's appropriate. You can explicitly mark a dataframe as small enough to broadcast for this: https://stackoverflow.com/a/37487575/5596181 [16:36:03] Shilad: Right [16:38:03] ebernhardson: Cool! I didn't know about that. Although an in-memory RDD is probably much bigger (?10X?) than an equivalent data structure with primitives using something like trove or debox? [16:38:40] Shilad: not sure, seems plausible though. the deserialized rdd is certainly going to have more boxing and whatnot [16:39:25] i'm not actually sure what spark is doing internally, if it keeps a serialized byte stream of the rows it might not be more memory. hard to say [16:39:45] nuria_: 1-1? [16:39:55] Like in 5 minutes nuria_ ? [16:40:11] let's reschedule as i have a meeting right after, we will do it this week though [16:40:34] k nuria_ - I'll work late, can be after your next meeting (depending on availability of course) [16:40:57] ok, rescheduled for later on today [16:41:25] thanks [16:42:48] joal or ebernhardson or whomever does Java / Scala Spark jobs with analytics-refinery: I'm wondering about using the shaded jars in analytics-job. In the past I've had problems with uber jars conflicting with Spark dependencies. Have you tried the uber jar out? Does it work okay? I've been using a stripped down version, but maybe it's not necessary. [16:42:56] 10Analytics: Procure hardware to refresh jupyter notebooks - https://phabricator.wikimedia.org/T175603#3597606 (10Nuria) [16:43:35] Shilad: hmm, i pull in refinery-hive.jar but i don't think thats an uber jar iirc [16:47:09] 10Analytics: Order hardware labs storage for mediawiki history - https://phabricator.wikimedia.org/T175604#3597629 (10Nuria) [16:47:26] ebernhardson, Shilad: refnery-hive is uber yes [16:47:26] 10Analytics: Order hardware labs storage for mediawiki history analytics friendly DB - https://phabricator.wikimedia.org/T175604#3597641 (10Nuria) [16:48:00] We've had issues with dependencies before, but still need some of deps to be here - I'm willing to test smaller jars shilad :) [16:49:43] so i guess perhaps i've just been lucky so far :) [16:50:05] oh of course its an uber jar ... because what i use from there is a lucene stemmer and lucene isn't pulled in any other way [16:55:04] ebernhardson: right [17:07:15] 10Analytics, 10User-Elukey: Move away from jmxtrans in favor of prometheus jmx_exporter - https://phabricator.wikimedia.org/T175344#3597697 (10elukey) [17:10:27] * elukey off! [17:41:49] nuria_: Hayhayhay ! I missed [17:41:59] nuria_: feeling better ;) [17:54:16] 10Analytics-Kanban: vet metrics calculated from the data lake - https://phabricator.wikimedia.org/T153923#3597892 (10Nuria) Ping @ezachte hello? [18:43:45] joal: do you know how I can make spark write gzip? [18:43:51] i keep trying stuff, not working [18:44:53] hmm, in spark 2 maybe, will try. [18:49:25] ottomata: cant just set the compression codec? [18:49:48] (maybe thats only spark 2, i havn't used 1.x in awhile) [18:51:17] yeah, am trying that, ithink it might work in 2 [18:51:56] .option("codec", "gzip") [18:51:57] will do it [18:52:06] looks liAHH [18:52:07] ahh [18:52:08] or maybe not? [18:52:09] hmm [18:52:16] should work... [18:52:16] grr [18:53:30] 10Analytics-Kanban: vet metrics calculated from the data lake - https://phabricator.wikimedia.org/T153923#3598243 (10DarTar) Hey @Nuria I spoke to Erik last Friday and he told me he was going to pick this up this week. I'll drop him an email if for whatever reason he missed these pings. [18:54:08] AH [18:54:11] there it is [18:54:18] .option("compression", "gzip") [18:55:35] huh, i would have expected "codec". Well at least you found it :) [18:57:28] haha, but now snappy doesn't work in 2! [19:09:38] 10Analytics-Kanban: vet metrics calculated from the data lake - https://phabricator.wikimedia.org/T153923#3598299 (10Nuria) Excellent, let's sync up on this once you are back, Erik. [19:11:49] Hey ottomata [19:11:57] Have you found what is needed? [19:12:47] ya i think so! [19:12:53] ok cool [19:12:54] in spark 2, i can do .option("compression" [19:12:55] seems to wokr [19:14:17] k [19:14:30] ottomata: are working with dataframes? [19:15:15] ya [19:15:18] read from parquet [19:16:13] ottomata: Yeah, I hit that before - I think the only way is to convert to RDD and write as textfile while setting codecv (for spark 1.6) - I'm glad it works with 2 :) [19:17:01] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3598347 (10GWicke) >>! In T175210#3597099, @mobrovac wrote: > IMHO, `updateBetaFeaturesUserCounts` is the perfect candidate... [19:19:31] joal: interesting finding! [19:19:38] for 1 hour of webrequest data [19:19:44] json gzip gets the best compression [19:19:52] well, parquet gzip beets it [19:19:54] beats it* [19:20:04] (but not really considering parquet for external export) [19:20:07] but, better than avro! [19:20:18] ottomata: Ah! Fun :) [19:20:24] hm, avro isn't using gzip. but deflate [19:20:27] so not same alg. [19:20:27] hm [19:20:31] i can't seem to make it use gzip? [19:20:32] hm [19:20:33] Ah [19:20:47] or can i... [19:20:59] https://issues.apache.org/jira/browse/AVRO-1243 - looks like you should :) [19:22:20] ya but maybe not in spark-avro [19:22:20] The supported types are uncompressed, snappy, and deflate [19:22:56] mwarf [19:22:58] but gzip and deflate are pretty much the same i think [19:23:19] I'm not good in codecs [19:23:22] Can' say [19:23:42] "GZip is simply deflate plus a checksum and header/footer" [19:23:45] i'm just googlin [19:23:56] k [19:24:02] So, should be the samw, hu [19:24:06] about anyway [19:24:14] i think for ease for everyone, gonna go with json [19:24:24] sounds good ottomata [19:24:51] For reusability, given good compression ratios, json seems the best [19:37:37] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3598393 (10Cmjohnson) a:05Cmjohnson>03elukey @elukey updated dns entries and swich ports to reflect vlan-private1-row-eqiad [19:41:57] Done for today a-team - see you tomorrow [19:42:02] o/ [19:42:03] byeeeee [20:35:01] 10Analytics-Kanban: vet metrics calculated from the data lake - https://phabricator.wikimedia.org/T153923#2895122 (10Erik_Zachte) @Nuria Ah I missed these. Sorry about that. A good example of why restructuring my mailbox was dearly needed, so I finally fine-tuned Gmail filters this weekend. So I could start vett... [20:54:14] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3598540 (10Cmjohnson) @elukey we finished these...correct? [23:47:33] 10Analytics, 10Scoring-platform-team: Grafana has confusing or wrong scale for "scores errored" graph - https://phabricator.wikimedia.org/T175651#3599015 (10awight) [23:48:58] 10Analytics, 10Scoring-platform-team: Grafana has confusing or wrong scale for "scores errored" graph - https://phabricator.wikimedia.org/T175651#3599004 (10awight)