[05:51:30] o/ [06:36:55] so aqs looks good afaics [06:39:30] checked also cassandra metrics for aqs1004, nothing weird found so far [06:39:42] so in ~1h I'd proceed with aqs1005 too [06:44:07] it would be great to have it reimaged by the end of the week [06:49:08] +1 elukey :) [07:14:48] ah joal I forgot to mention that we are going to get 10G NICs probably with the new hadoop worker nodes :) [07:16:20] (brb for a bit!) [07:38:04] back [07:38:09] going to reimage aqs1005! [08:00:34] 10Analytics, 10Patch-For-Review, 10User-Elukey: Upgrade AQS to Debian Stretch - https://phabricator.wikimedia.org/T196138 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['aqs1005.eqiad.wmnet'] ``` and were **ALL** successful. [08:10:41] aqs1005 reimaged, looks good [08:41:15] (03CR) 10Sahil505: "I think we can merge this change now and take care of the chart (bar & line) bug in the new one :-]" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/442257 (https://phabricator.wikimedia.org/T189619) (owner: 10Sahil505) [08:54:58] 10Analytics, 10Operations, 10hardware-requests: eqiad: (2) hardware refresh for analytics1003 - https://phabricator.wikimedia.org/T198685 (10elukey) [09:03:53] joal: everything looks stable, do you mind if I reimage aqs1006 as well ? [09:11:08] 10Analytics, 10Patch-For-Review, 10User-Elukey: Upgrade AQS to Debian Stretch - https://phabricator.wikimedia.org/T196138 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['aqs1006.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimag... [09:11:24] (doing it now) [09:29:46] 10Analytics, 10Patch-For-Review, 10User-Elukey: Upgrade AQS to Debian Stretch - https://phabricator.wikimedia.org/T196138 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['aqs1006.eqiad.wmnet'] ``` and were **ALL** successful. [09:32:46] aqs1006 reimaged :) [09:33:02] since the process is so smooth, I think I'll be done tomorrow [09:50:58] (03PS1) 10Fdans: Changes map component to accept concrete numbers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443582 [09:51:06] (03CR) 10jerkins-bot: [V: 04-1] Changes map component to accept concrete numbers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443582 (owner: 10Fdans) [09:51:44] (03PS2) 10Fdans: Changes map component to accept concrete numbers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443582 (https://phabricator.wikimedia.org/T188928) [09:51:51] (03CR) 10jerkins-bot: [V: 04-1] Changes map component to accept concrete numbers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443582 (https://phabricator.wikimedia.org/T188928) (owner: 10Fdans) [09:58:24] (03PS3) 10Fdans: Changes map component to accept concrete numbers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443582 (https://phabricator.wikimedia.org/T188928) [10:04:08] (03PS4) 10Fdans: Changes map component to accept concrete numbers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443582 (https://phabricator.wikimedia.org/T188928) [10:07:48] (03CR) 10Mforns: [V: 032 C: 031] "Yea, I agree. But let's merge this change only when the other task (bug) is finished. This way, we do not block deployments (in the unlike" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/442257 (https://phabricator.wikimedia.org/T189619) (owner: 10Sahil505) [10:10:21] 10Analytics, 10Operations, 10hardware-requests: eqiad: (1) new stat box to offload users from stat1005 - https://phabricator.wikimedia.org/T196345 (10faidon) The argument that switches between stat boxes are expensive in staff time, so we should make them less often doesn't resonate much with me (maybe we sh... [10:11:22] 10Analytics: Q1 2018/19 Analytics procurement - https://phabricator.wikimedia.org/T198694 (10elukey) p:05Triage>03Normal [10:18:29] 10Analytics: Q1 2018/19 Analytics procurement - https://phabricator.wikimedia.org/T198694 (10elukey) [10:18:41] all hw requests listed in --^ [10:18:45] so we can track their status [10:32:37] 10Analytics: Order Data Lake Hardware - https://phabricator.wikimedia.org/T198424 (10elukey) Hardware request task for the Druid Analytics cluster: https://phabricator.wikimedia.org/T166510 3 nodes with 64G of RAM and four Intel 1.6T SSD disks. We are currently setting a 2.9T raid10 lvm ext4 partition on each n... [10:32:56] joal: --^ (whenever you have time to add comments) [10:41:56] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Partially purge MobileWikiAppiOSUserHistory eventlogging schema - https://phabricator.wikimedia.org/T195269 (10chelsyx) @mforns > I think the graph in T130432 has changed since then. I executed the provided query and on 2017-04-20 th... [10:57:32] 10Analytics, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578 (10fdans) Ping @Tbayer let's mark this as resolved? [11:24:04] (03PS1) 10Sahil505: Updated the functioning & styling of filter & split switches [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443592 (https://phabricator.wikimedia.org/T198183) [11:39:44] reimaging aqs1007 now! [12:19:08] Hi elukey! You're unstoppable with aqs reimaging :) [12:19:54] \o/ [12:21:35] elukey: I'm not sure how you'd like me to comment on presto RAM thing :) [12:21:40] I'm gonna look for documentation :) [12:23:44] if it makes sense or not, your thoughts etcc.. [12:27:21] joal: aqs1007 repooled! [12:27:26] \o/ [12:27:26] only two to go [12:29:16] elukey: scanning stuff on presto - Interesting to understand parquet: https://eng.uber.com/presto/ [12:35:40] "we had over two thousand people running more than one hundred thousand analytic queries daily" [12:35:54] elukey: samll stuff ;) [12:38:07] joal: one qs - will the data lake in labs only run presto without any hadoop/hdfs/etc.. related thing? [12:38:18] elukey: I assume so yes [12:38:31] elukey: My understanding is that the data will be available on every node [12:39:22] elukey: The more I read, the more I think presto uses RAM for intermediate joins/aggregation, but reads original data at every query [12:39:59] I'm assuming therefore that system-cache is important in our use case (we could also test the in-memory-connector) [12:40:36] there is a memory connector (https://prestodb.io/docs/current/connector/memory.html) but I think it is not really great for our use case [12:40:50] and yes, page cache will surely help! [12:41:26] elukey: memory-connector seems to be oriented toward small/medium datasets size [12:41:59] also elukey - Everything I read is about collocating presto and hoddop workers [12:42:24] For internal use case, could be interestng? [12:42:49] it could indeed [12:43:40] the main question mark that I see now is: do we know how presto behaves/performs on local file system, and what are the trade-offs ? [12:44:35] elukey: read-throuput over many nodes is obviously a huge win [12:44:57] sure but how data is layed down across nodes? [12:45:01] same data in all of them? [12:45:07] yessir [12:45:57] I would set up a test presto cluster in labs first [12:46:08] play with it and figure out what we need [12:46:17] atm I feel that we are only making suppositions [12:46:25] am I wrong? [12:46:46] elukey: looks like presto doesn't have a connector for local files (except some specific htt-request logs) [12:46:57] Absolutely not - Let's play in labs [12:47:03] +1 [12:47:37] because it we discover that hdfs is "practically" needed, then it might require us to think about the hw that we need [12:48:57] elukey: indeed !!! [12:49:20] elukey: thinking about the hardware is a unquestionable requirement :) [12:50:31] ack :) [12:54:07] need to go afk for ~1h for an errand, will read later on! [12:54:10] * elukey afk for ~1h [12:54:26] ack [12:57:58] o/ :) [13:23:18] heyaaaa :] [13:58:18] hmm, anyone know if there is a way to configure python http response to assume responses are utf-8 and decode them? [13:58:46] i'm getting a strange error deep in some library stuff: failed with exception: - 'the JSON object must be str, not 'bytes'' [13:58:57] i can tell this is happening from https://github.com/toidi/hadoop-yarn-api-python-client/blob/master/yarn_api_client/base.py#L19 [13:59:11] if i modify that code and .decode('utf-8') its fine [13:59:18] but there are also other places around that library where that happens [14:05:04] 10Analytics, 10EventBus, 10Services (done): Create schema evolution tests for event-schemas - https://phabricator.wikimedia.org/T133419 (10Pchelolo) 05Open>03Resolved The tests have been created a long time ago. Resolving. [14:11:13] 10Analytics, 10EventBus, 10monitoring, 10Services (done): Clean up retry-retry Kafka topics - https://phabricator.wikimedia.org/T179958 (10Pchelolo) 05Open>03Resolved This has been done both for Kafka topics and monitoring. [14:23:32] * elukey back [14:23:43] (took a bit more than expected) [14:24:07] ottomata: o/ [14:54:55] (03PS1) 10Mforns: Make bar-chart and line-chart resilient to breakdowns with null values [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443640 (https://phabricator.wikimedia.org/T198630) [14:58:49] (03CR) 10Mforns: [C: 04-1] "Still WIP, we have to fix some bar-width problems originated by the same undefined-values concept." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/443640 (https://phabricator.wikimedia.org/T198630) (owner: 10Mforns) [15:01:21] having trouble joining the batcave [16:07:27] Hi a team! I remember talking to someone (possibly milimetric) about looking at the number of clicks on links on wikipages? was I dreaming? is there an easy to query data set for that somewhere? [16:08:20] addshore: you can use pagelinks in combination with webrequest and the referer field to get answers but there's nothing pre-generated [16:09:05] ack [16:10:18] addshore: also baha and leila just launched a new schema that I think tracks clicks on reference links, but that sounds different from what you need [16:10:33] addshore: just in case you don't know, wmf_raw.mediawiki_pagelinks is sqooped monthly [16:10:51] thanks! [16:12:01] addshore: might you be referring clickstream? [16:12:20] joal: perhaps, I couldn't find much on wikitech about clickstream though [16:12:26] addshore: https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream [16:12:47] aaah, because it is not on wikitech! [16:13:13] addshore: https://dumps.wikimedia.org/other/clickstream/readme.html [16:13:57] I also see a clickstream db in hadoop [16:15:38] addshore: this database is the old one Ellery used [16:15:59] I always forget about this dataset, thanks joal [16:17:05] addshore: newly computed data is accessible on HDFS here: /wmf/data/archive/clickstream [16:17:25] np milimetric :) [16:18:25] but not directly availible as a table / db in hive? :) [16:18:57] nope addshore [16:19:09] We could (should?) have done that [16:22:33] * addshore still doesn't know how to query other data in HDFS ;) [16:28:57] addshore get into some spark! [16:33:54] * elukey off! [16:34:07] addshore: I second ottomata here :) [16:43:14] (03CR) 10Sahil505: [C: 04-1] "WIP" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/437387 (https://phabricator.wikimedia.org/T190915) (owner: 10Sahil505) [17:06:46] ottomata: joal I'll have to ;) [17:44:33] addshore: I have tricks for you if you wish ) [18:19:24] joal, yt? :] have you done spark-scala profiling before? [18:19:47] Hi mforns [18:20:01] Not very deeply no [18:20:40] joal, I'm looking into this: https://www.linkedin.com/pulse/profiling-spark-applications-one-click-michael-spector/ [18:20:45] https://raw.githubusercontent.com/spektom/spark-flamegraph/master/spark-submit-flamegraph [18:21:52] mforns: I know we could also go for https://github.com/criteo/babar [18:22:01] lookin [18:22:26] that looks better [18:23:00] mforns: More "integrated" I think [18:23:05] joal, suddenly the EL Sanitization is a LOT slower [18:23:09] mforns: But more work to put into place [18:23:18] hm mforns [18:23:23] batcave for a moment? [18:23:27] and I think it's since the logs have the geocoded_data field populated [18:23:28] sure [18:28:59] 10Analytics, 10Product-Analytics: Upgrade SWAP's JupyterLab from - https://phabricator.wikimedia.org/T198738 (10Ottomata) Since the dependencies are installed in each user's personal virtualenv, you can actually do this yourself in the Jupyter CLI. Launch a terminal in Jupyter and then do ``` pip install --u... [18:32:16] (03PS1) 10Ottomata: Updating wheels with Apache toree 0.2.0 rc5, and jupyterlab 0.32.1 [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/443669 (https://phabricator.wikimedia.org/T198738) [18:33:39] (03CR) 10Ottomata: [V: 032 C: 032] Updating wheels with Apache toree 0.2.0 rc5, and jupyterlab 0.32.1 [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/443669 (https://phabricator.wikimedia.org/T198738) (owner: 10Ottomata) [18:46:15] (03PS1) 10Ottomata: Use locally committed toree .tar.gz to build wheel [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/443672 [18:47:06] (03CR) 10Ottomata: [V: 032 C: 032] Use locally committed toree .tar.gz to build wheel [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/443672 (owner: 10Ottomata) [19:41:10] 10Analytics, 10Product-Analytics, 10Patch-For-Review: Upgrade SWAP's JupyterLab from beta 1 to beta 2 - https://phabricator.wikimedia.org/T198738 (10Neil_P._Quinn_WMF) >>! In T198738#4395417, @Ottomata wrote: > Since the dependencies are installed in each user's personal virtualenv, you can actually do this... [19:44:00] joal: yt? [19:44:25] Hi ottomata [19:44:28] ok great [19:44:30] Yes I'm here [19:44:33] want to try some toree stuff [19:44:39] can you open tunnel to notebook1004? [19:44:45] I'm with mforns trying to get what's wrong with sanitiation [19:44:47] gotta run some commands and see if some kernels wil work [19:44:50] oh ok [19:44:52] no hurry! [19:45:01] will ping back [19:47:08] OH! you've never logged into notebook1004 before great [19:47:11] this is an even better test [20:16:53] ottomata: wating for some job o work - Can I help? [20:16:59] YESSSSS lets try [20:17:07] ssh tunnel to notebook1004 [20:17:21] ssh -N notebook1004.eqiad.wmnet -L 8000:127.0.0.1:8000 [20:17:35] and let's see what kernels you have [20:17:38] and if they work... [20:18:19] hmm [20:18:20] login in [20:18:23] OOOO [20:18:27] ok something won't work... [20:19:32] ok hold on, we try again in a bit...i could just make it work, but this login part should work [20:19:37] so lemme see if i can fix that [20:20:33] np ottomata [20:25:50] ok joal try again plz [20:26:32] hmm [20:26:51] joal what do you see? [20:26:57] if there is a logout option, or a stop my sever [20:26:58] can you do that? [20:27:17] I started server --> error 500 :( [20:27:22] right [20:27:28] log out if you can [20:27:31] done [20:27:36] login back ? [20:27:36] ok, try logging back in [20:27:42] ya [20:27:53] nope [20:27:59] ok one more [20:28:00] can you log out? [20:28:05] again? [20:28:16] joal: ? [20:28:28] sure [20:28:44] done? [20:28:45] done [20:29:36] ok, try logging in again joal [20:30:28] looking betterrrrr [20:30:39] do you have any very enticing looking spark kernels? [20:30:48] Wooow [20:30:53] Let me see [20:30:55] try em! [20:31:05] any? [20:31:27] sure [20:31:31] the yarn ones in particular though [20:31:33] the locals are easy [20:31:47] ok - Tried a local one first :) [20:34:19] ottomata: spark-local means, I run spark, but not on the cluster [20:34:23] I got it now [20:34:28] TRying spark yarn [20:34:41] I thought spark-yarn was having the kernel run on yarn :) [20:34:58] ah no [20:35:06] stopped working on that, but learned a lot about how that works if we wanted to [20:35:07] its just toree [20:37:13] ottomata: Scala works <3 [20:37:20] woohoo [20:37:23] with hive yes? [20:37:29] not tested yet [20:41:25] hmmm seeing some errors in your logs... [20:41:32] (Too many open files) [20:42:02] up [20:42:11] this time it succeeded [20:42:30] ottomata: However errors are not nicely reported for scala [20:42:37] No big deal, but not really helpful [20:42:54] yeah i think logs are going to be abig problem with this [20:43:33] hm, i should make a quick link or something for folks to tail their notebook logs [20:43:52] ottomata: actually this time it worked [20:44:03] btw, if you run journalctl -f -u jupyter-joal-singleuser [20:44:05] you can see them [20:44:36] joal hm ok [20:44:42] was it just a big hive query that faield? [20:45:20] yessir [20:45:28] in local (my bad :( [20:45:46] ohh [20:45:49] interesting [20:45:51] ok then that makes mroe sense [20:45:56] joal anotheir thing we should try [20:45:56] https://github.com/Brunel-Visualization/Brunel/wiki [20:46:00] i don't fully understand how to try it out [20:46:08] I've seen that yes [20:46:10] i have installed the %%brunel magic tho [20:46:13] before [20:46:15] that was not hard [20:46:18] ok [20:46:28] Will have as look :) [20:47:22] ok cool! do play around with it. if we are happy i'll install for all users and then we can announce [20:47:42] that's super great :) [20:47:47] Thanks a mil ottomata for that :) [20:47:54] yaaa its very cool [20:47:55] ! [20:48:04] with scala there i'll probably start using notebook way more [20:49:54] (03PS1) 10Ottomata: Install toree kernels for all users [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/443736 (https://phabricator.wikimedia.org/T190443) [20:50:07] ottomata: If brunnel (or vegas) does a good job at graphs, I'll do too - Otherwise, I'll still with python I think (sparkDataframe.toPandas() is just magic) [20:50:19] (03CR) 10Ottomata: [V: 032 C: 032] Install toree kernels for all users [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/443736 (https://phabricator.wikimedia.org/T190443) (owner: 10Ottomata) [20:50:34] ottomata: tested spark SQL - works like a charm :) [20:50:54] yeehaw [20:53:04] ottomata: Do I need the add jar for brunel, or is it done for me? [20:54:10] you need to add [20:54:14] Ah ok :) [20:54:26] i think it will work from the url [20:54:31] ok trying that [20:54:33] if not, you can dl manually and copy to your home and do file:: url [21:00:33] mforns: found --conf spark.ui.retainedJobs as well [21:00:40] we should that to 100 or something [21:01:08] joal, should I add --conf spark.ui.retainedJobs=100 ? [21:01:19] k [21:01:28] mforns: If you're launching new big jobs, would be interesting yes [21:01:34] done [21:01:50] the march job repair job finished with success [21:01:55] weekly is much better [21:02:00] mforns: Right :) [21:02:04] cool [21:03:44] ottomata: can't manage to have the jar working :( [21:04:20] Ah, this time it does [21:05:14] glad i could help! [21:05:38] mwarf - seems the notebook doesn't like something [21:05:52] The brunel jar magic never ends [21:06:51] and I don't have rights to read logs ottomata [21:08:48] oh [21:08:51] it is probably not dling [21:08:56] joal the AddJar you mean? [21:09:01] ottomata: downloaded it and managed to load [21:09:04] yessir [21:09:05] oh [21:09:16] I think the restart of a toree kernel is heavily slow [21:09:34] well, its gotta load spark session iguess [21:09:41] so it dled [21:09:54] nope, downloaded the thing manually [21:09:59] that's as far as I got with it too, i didn't try more cause their docs weren't very clear to me how to use [21:10:00] ah ok [21:10:08] Jul 03 21:10:01 notebook1004 bash[1174]: Caused by: org.brunel.model.VisException: Could not find the x field 'projet' while building the visualization: element[1x0] [21:10:15] joal, you don't have the ability to see those logs? [21:10:22] nope [21:10:24] hm [21:10:26] that's weird [21:10:33] I see them in the UI, but not with the command [21:10:57] hm ok, gonna make a task abou tthat [21:11:01] i think thats gonna be important [21:12:36] so joal [21:12:37] you can't see things like [21:12:38] Jul 03 21:11:45 notebook1004 bash[1174]: org.brunel.model.VisException: Unknown action 'lines(?)' while parsing action text: data('df') lines x(project) y(c) [21:12:38] ? [21:13:15] I acn in the UI [21:13:20] oh [21:13:22] But I have no other access [21:13:22] they show up there? [21:13:25] They do ! [21:13:27] oh ok [21:13:29] that's good at least [21:13:39] yeah yeah - They show up in UI [21:13:59] There have been 1 time when they have not, but since, everyhing is fine [21:19:35] no change with brunel so far [21:22:37] mforns: did we moved the whitelist to refinery source then? [21:22:45] mforns: or is that code still in teh works [21:22:50] nuria_, not refinery source, but refinery [21:22:55] mforns: ahahah [21:23:17] yes, it's under /static_data/eventlogging/whitelist.yaml [21:24:19] mforns: right! is the mysql python running off teh other list or not running now? [21:24:40] it is running off the other whitelist that is ensured by puppet [21:25:03] we need to deploy refinery in the eventlogging hosts [21:25:14] I thought I had created that task, but I'm not sure [21:25:59] no, can not find it in phab [21:26:05] will create one now [21:26:26] ottomata: looks like our proxy makes me fail to download stuff from inside the workbook :( [21:26:37] ottomata: any idea on how to solve? [21:27:25] 10Analytics: Deploy refinery to eventlogging hosts - https://phabricator.wikimedia.org/T198766 (10mforns) [21:27:33] nuria_, ^ [21:28:40] we can also include the puppet change that whill call the purging script for mysql with the new whitelist path [21:32:01] updated the task description [21:41:42] mforns: thank you [21:42:50] mforns: and you talked with ottomata and he is ok with deploying refinery to EL hosts... [21:43:09] mforns: also we run whitelist code in beta labs so we would need to update that too [21:43:14] nuria_, yes we talked about this, right otto? [21:43:38] I see [22:05:11] 10Analytics, 10Cassandra, 10RESTBase-Cassandra, 10monitoring, and 2 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10Eevans) AQS has been partially upgraded: ```name="prometheus-jmx-exporter" aqs1005.eqiad.wmnet: Installed: 1:0.3.0-1 aqs1008... [22:06:38] 10Analytics, 10Cassandra, 10RESTBase-Cassandra, 10monitoring, and 2 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10Eevans) [22:06:58] 10Analytics, 10Cassandra, 10monitoring, 10Puppet, 10Services (watching): Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10Eevans)