[04:12:31] <wikibugs>	 10Analytics, 10Operations, 10netops: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041596 (10ayounsi) a:03ayounsi 1st change applied.  Waiting for confirmation for the 2nd.
[05:37:16] <wikibugs>	 10Analytics, 10Operations, 10Traffic, 10HTTPS: Update documentation for "https" field in X-Analytics - https://phabricator.wikimedia.org/T188807#4041656 (10Tbayer)
[05:40:34] <wikibugs>	 10Analytics, 10Operations, 10Traffic, 10HTTPS: Update documentation for "https" field in X-Analytics - https://phabricator.wikimedia.org/T188807#4041659 (10Tbayer) @BBlack Thanks again! Back to the task at hand: I have tentatively updated the documentation based on my understanding of your remarks: https:/...
[07:07:18] <elukey>	 morning people!
[07:07:19] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&orgId=1&from=now-12h&to=now&panelId=42&fullscreen&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All
[07:08:02] <elukey>	 we have been seeing a ton of reqs to druid public brokers from ~4:30 AM
[07:20:32] <elukey>	 from the logs it seems that somebody is using the aqs edit api for big chunks of history
[07:51:19] <elukey>	 added the "edited-pages" graph to aqs-elukey
[07:51:21] <elukey>	 https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&panelId=21&fullscreen
[07:51:48] <elukey>	 so we can see spikes in those metrics too
[07:52:12] <elukey>	 top by edits 2xx seems to be what changed at 4:30 am
[07:58:11] <elukey>	 now one thing that I've never done is tracking the external IP making requests
[07:58:22] <elukey>	 of course on druid it tells me aqs, and on aqs restbase
[08:04:29] <joal>	 Hi elukey 
[08:04:57] <joal>	 elukey: We've not experienced this kind of traffic yet - this is interesting
[08:05:08] <elukey>	 hey joal 
[08:05:26] <elukey>	 very nice to see how cache grows for brokers and then eviction kicks in
[08:06:32] <joal>	 yup
[08:06:48] <joal>	 elukey: no error from aqs not druid for tongiht traffic?
[08:10:05] <elukey>	 not that I can see
[08:10:22] <elukey>	 the last stream of aqs errors on icinga was yesterday at around midnight utc
[08:12:01] <joal>	 elukey: I can't even see it
[08:14:02] <wikibugs>	 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#4041789 (10JAllemandou) @diego : Thanks :) I push for productionization to be one of our priorities, but there is a fight for spots in the prioritized items ;)
[08:15:09] <elukey>	 joal: should be https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&orgId=1&from=1520719048809&to=1520736408635&panelId=42&fullscreen&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=druid_public&var-druid_datasource=All
[08:15:36] <joal>	 Oh right, it was on the 10th
[08:15:56] <joal>	 sorry didn't get the date correct
[08:17:02] <joal>	 elukey: from log size, the 10th have been a big day for druid
[08:18:14] <elukey>	 weird that I can't see it from the aqs graph
[08:18:28] <joal>	 elukey: I have the same wonder
[08:18:41] <joal>	 elukey: It think it might be related to reindex-day
[08:19:47] <elukey>	 so on the 10th ~22:40 there was a big spike for aqs
[08:19:48] <elukey>	 https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1520717902632&to=1520725321361&panelId=21&fullscreen
[08:20:45] <elukey>	 I am surely missing some metrics in this graph
[08:24:28] <elukey>	 (there is also an issue with cache text in esams now, lovely morning)
[08:24:52] <joal>	 Yay - aqs can wasit elukey 
[08:25:32] <elukey>	 a lot of people are checking it, so I can watch both :)
[08:25:54] <elukey>	 atm I am trying to think if there is a way to see who is requesting data to restbase -> aqs -> druid
[08:26:44] <joal>	 elukey: check webrequest !
[08:27:47] <elukey>	 it might be our best bet yes
[08:28:34] <elukey>	 joal: can you run a query (you'll be faster and more precise than me) to figure out who's requesting top by edits ?
[08:29:37] <joal>	 elukey: Yes sir- time interval?
[08:30:10] <elukey>	 basically anytime between 4:31 UTC to now
[08:30:18] <joal>	 k
[08:30:45] <elukey>	 also on oxygen there should be sampled logs
[08:33:21] <elukey>	 joal: how does the path for these api reqs looks like in a webrequest?
[08:34:25] <joal>	 elukey: hive seems to have an issue
[08:35:22] <joal>	 elukey: sorry, usual PBCAK
[08:35:31] <joal>	 elukey: will have data soon
[08:35:57] <elukey>	 ahhahaah
[08:36:25] <elukey>	 joal: whenever you have time let me know the path that your are looking for so I can also check the samples webreq logs
[08:36:33] <elukey>	 as they come in to oxygen
[08:36:39] <joal>	 elukey: /api/rest_v1/metrics/edited-pages/top-by-edits
[08:36:48] <elukey>	 thanks
[08:37:12] <joal>	 elukey: hive gives me no result for hour 6
[08:37:13] <joal>	 hm
[08:37:44] <elukey>	 tailing from oxygen doesn't return anything too
[08:38:47] <joal>	 elukey: Given the sampling, I'm not too much surprised
[08:39:05] <joal>	 elukey: However I did another PBCAK ... Maybe I'll have data at some point 
[08:39:25] * joal will have some more coffee
[08:39:45] <joal>	 elukey: I haz dataz
[08:39:51] <joal>	 Single IP
[08:50:47] <elukey>	 !log fixed evenglog1002's ipv6 (https://gerrit.wikimedia.org/r/#/c/418714/)
[08:50:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:52:08] <elukey>	 nice, now rsync eventlog1002 -> stat1005 seems to work
[09:05:51] <wikibugs>	 10Analytics, 10Operations, 10netops: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041924 (10elukey)
[09:12:38] <wikibugs>	 10Analytics, 10Operations, 10netops: Replace eventlog1001's IP with eventlog1002's in analytics-in4 - https://phabricator.wikimedia.org/T189408#4041928 (10elukey) Since we are doing some cleanups, I'd also like to review the following:  ``` term mysql {     from {         destination-address {             10...
[09:13:09] <wikibugs>	 10Analytics, 10Operations, 10netops: Review some IPs in the analytics-in4 filter - https://phabricator.wikimedia.org/T189408#4041932 (10elukey)
[09:49:26] <wikibugs>	 10Analytics-Kanban: Correct mediawiki-reduced loading job - https://phabricator.wikimedia.org/T189448#4042079 (10JAllemandou)
[09:49:45] <wikibugs>	 10Analytics-Kanban: Correct mediawiki-reduced loading job - https://phabricator.wikimedia.org/T189448#4042090 (10JAllemandou)
[09:50:14] <wikibugs>	 (03PS2) 10Joal: Correct oozie mediawiki-reduced job dependency [analytics/refinery] - 10https://gerrit.wikimedia.org/r/417994 (https://phabricator.wikimedia.org/T189448)
[09:51:42] <wikibugs>	 10Analytics-Kanban: Improve mediwiki-history performan - https://phabricator.wikimedia.org/T189449#4042107 (10JAllemandou) a:03JAllemandou
[09:51:54] <wikibugs>	 10Analytics-Kanban: Improve mediwiki-history performance - https://phabricator.wikimedia.org/T189449#4042096 (10JAllemandou)
[09:55:11] <wikibugs>	 (03CR) 10Joal: "Single comment inline, but not preventing from moving forward. Super good solution." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418052 (https://phabricator.wikimedia.org/T189332) (owner: 10Ottomata)
[10:21:04] <elukey>	 joal: as fyi there was a problem with mirror maker main-eqiad->jumbo
[10:21:12] <elukey>	 I had to restart mirror maker on kafka1020
[10:21:26] <elukey>	 apparently all consumers in the group where getting no partitions assigned
[10:21:33] <joal>	 :(
[10:21:37] <elukey>	 so nobody was producing :(
[10:21:44] <elukey>	 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?orgId=1&from=now-24h&to=now&var-instance=main-eqiad_to_jumbo-eqiad
[10:22:07] <joal>	 Since yesterday 20:00 :(
[10:22:20] <joal>	 no alarms on that - not super nice
[10:22:36] <joal>	 Have we changed anything on kafka yesterday?
[10:22:47] <elukey>	 nothing that I know
[10:23:13] <elukey>	 I agree that we need alarming, I think that we were going to add them this week but this happened yesterday
[10:34:12] <elukey>	 joal: should we file a pull req to lower down the rate limit for druid to 50rps?
[10:34:24] <elukey>	 (iirc it is per ip)
[10:34:37] <joal>	 elukey: it is per IP - I wonder if it is needed though
[10:35:48] <elukey>	 I think so, I'd also lower it down more, this is only one client and I don't love how latency metrics changed
[10:36:57] <elukey>	 GC time for historical is horrible now :D
[11:32:16] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10User-Elukey: Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo - https://phabricator.wikimedia.org/T189464#4042527 (10elukey) p:05Triage>03High
[11:33:21] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10User-Elukey: Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo - https://phabricator.wikimedia.org/T189464#4042527 (10elukey) @Gehel found the problem this morning while trying to switch wqds back to Jumbo. He the rolled back and opened https://ph...
[11:34:39] <elukey>	 ok so we have currently two pending issues:
[11:35:22] <elukey>	 1) Druid's public endpoint seems to be consumed by a bot and we should lower down our rate limit to 50rps/ip or even lower
[11:35:39] <elukey>	 2) Mirror Maker is not behaving/replicating, started from yesterday
[11:35:44] <elukey>	 I opened a task about 2)
[11:36:34] <elukey>	 going to be afk for ~2h for lunch + errand but ping me if I am needed
[11:36:48] <elukey>	 when I am back I am planning to work on both with Joseph/Andrew
[12:19:09] <wikibugs>	 (03PS27) 10Joal: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns)
[12:56:17] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4042934 (10Joe) I'd take the chance we have to do this to do as follows:  # Add `mediawiki::multimedia` to the jobrunners # Add a seco...
[13:25:07] <ottomata>	 yoohooo
[13:26:12] <wikibugs>	 (03CR) 10Ottomata: Get smart and hacky about compatible type casting (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418052 (https://phabricator.wikimedia.org/T189332) (owner: 10Ottomata)
[13:26:50] <wikibugs>	 (03PS2) 10Ottomata: Get smart and hacky about compatible type casting [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418052 (https://phabricator.wikimedia.org/T189332)
[13:29:29] <ottomata>	 joal:  am testing a couple of things, you think we can do a refinery source release in a min?
[13:33:42] <joal>	 Hey ottomata 
[13:33:56] <joal>	 ottomata: We an go for refinery-source when you wish
[13:35:15] <ottomata>	 cooool i haven't been following the recent changes you and mforns have been doing to whitelist, but I say +1 to wahtever yall are doing
[13:35:23] <ottomata>	 so merge that at will, maybe add an entry to changelog about it
[13:35:23] <joal>	 :)
[13:38:30] <wikibugs>	 (03PS28) 10Joal: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns)
[13:38:37] <joal>	 ottomata: Just pushed a patch including changelog modif
[13:38:47] <joal>	 ottomata: +1?
[13:39:21] <ottomata>	 joal:  +1 but derby.log :)
[13:39:29] <joal>	 ottomata: Ah !! Crap
[13:40:12] <wikibugs>	 (03PS29) 10Joal: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns)
[13:40:24] <elukey>	 I am back :)
[13:40:29] <elukey>	 ottomata: o/
[13:40:43] <joal>	 Hi again elukey 
[13:40:50] <ottomata>	 hiii
[13:43:32] <elukey>	 ottomata: did you make any change to mirror maker? (restart etc.. ?)
[13:43:36] <elukey>	 main-eqiad -> humbo
[13:43:37] <elukey>	 hahaha
[13:43:39] <elukey>	 jumbo
[13:44:01] <elukey>	 it works now but with that weird spiky pattern
[13:44:20] <ottomata>	 elukey:  not since i rolled back the mw monolog producer last week
[13:44:55] <elukey>	 ahhh okok so there is a task in your emails then :)
[13:45:48] <ottomata>	 ok :)
[13:45:53] <ottomata>	 hmm, joal we might have a problem:
[13:46:03] <ottomata>	 ls -lh /mnt/hdfs/wmf/data/event/SearchSatisfaction/year=2018/month=3/day=5/hour=2
[13:46:08] <ottomata>	 vs
[13:46:18] <ottomata>	 ls -lh /mnt/hdfs/wmf/data/event/SearchSatisfaction/year=2018/month=3/day=5/hour=1
[13:46:20] <ottomata>	 file sizes
[13:46:26] <ottomata>	 hour=2 is with the new code
[13:46:42] <ottomata>	 i wonder if this is beacuse it iterates over all the row rdd somehow?
[13:47:04] <ottomata>	 (also no _REFINED flag, not sure why, checking)
[13:48:07] <joal>	 ottomata: could be because of the number of files
[13:48:18] <joal>	 I don't why, but hour=2 has 200 files
[13:49:23] <ottomata>	 yeah, but i just refined hour=2 with the new code
[13:49:26] <ottomata>	 hour=1 was done with the old code
[13:49:38] <ottomata>	 so, the new code creates more smaller files
[13:50:51] <joal>	 ottomata: the new code repartition the data, and therefore writes a lot more smaller files
[13:51:12] <joal>	 I think 200 is the default number of partitions for spark-SQL related commands
[13:52:16] <ottomata>	 ohhh
[13:52:17] <ottomata>	 ok
[13:52:22] <ottomata>	 so it this acceptable you think?
[13:52:36] <joal>	 ottomata: Nope
[13:53:10] <joal>	 ottomata: hour 1 has 3 small files, so 200 makes them too small (and too many)
[13:54:04] <joal>	 ottomata: 150kb avg for hour 2, 1Mb, 5Mb and 5Mb for hour 1
[13:54:32] <joal>	 ottomata: Where do we use SQL-related commands in the code?
[13:56:21] <ottomata>	 yes agree
[13:56:45] <ottomata>	 joal:  when writing the data
[13:56:55] <ottomata>	 well
[13:56:56] <ottomata>	 maybe?
[13:56:57] <ottomata>	 insertInto?
[13:56:58] <ottomata>	 outputDf.write.mode("overwrite")
[13:56:58] <ottomata>	     .partitionBy(partitionNames:_*)
[13:56:58] <ottomata>	     .insertInto(partition.tableName)
[13:57:10] <joal>	 ottomata: This has not changed, has-it?
[13:57:11] <ottomata>	 but, we did that before too
[13:57:12] <ottomata>	 right
[13:57:14] <joal>	 yeah
[13:57:17] <ottomata>	 what has changed is convertDataFarme
[13:57:21] <joal>	 hm
[13:57:23] <ottomata>	 before, we re-read the json with the schema
[13:57:26] <ottomata>	 and then wrote that dataframe
[13:57:46] <ottomata>	 convertToSchema*
[13:57:56] <joal>	 ok
[13:58:00] * fdans DST is stupid
[13:58:02] <ottomata>	 maybe convRdd has too many partitions?
[13:58:40] <joal>	 ottomata: I think convertRdd makes use of DataFrame API, which by default set the number of partitions to 200
[13:59:04] <joal>	 ottomata: 2 approaches: remove DataFrame API calls - Or preset number of partitions for SQL
[13:59:13] <joal>	 ottomata: Is the code merged
[13:59:15] <joal>	 ?
[13:59:20] <ottomata>	 no]
[13:59:21] <ottomata>	 oh
[13:59:25] <ottomata>	 convert is
[13:59:26] <ottomata>	 the casting isn't
[13:59:33] <ottomata>	 let's merge the casting, it works
[13:59:36] <ottomata>	 i refined some data that failed before
[13:59:40] <joal>	 awesome
[13:59:45] <ottomata>	 so i'm going to merge that, i also have a fix for a string thing
[13:59:47] <ottomata>	 and then lets fix this
[13:59:50] <ottomata>	 repartition?
[13:59:56] <joal>	 yessir
[14:00:08] <ottomata>	 joal +1 on https://gerrit.wikimedia.org/r/#/c/418052/
[14:00:10] <joal>	 I think whitelisting is ready (I did not test, but Marcel did)
[14:00:14] <ottomata>	 ok cool
[14:00:15] <ottomata>	 let's merge it too
[14:00:18] <joal>	 yes
[14:00:29] <ottomata>	 i merge whitelist you merge mine 
[14:00:31] <ottomata>	 :)
[14:00:33] <joal>	 :)
[14:00:44] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns)
[14:00:56] <wikibugs>	 (03CR) 10Joal: [C: 032] Get smart and hacky about compatible type casting [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418052 (https://phabricator.wikimedia.org/T189332) (owner: 10Ottomata)
[14:01:19] <joal>	 Ok, now let's wait for jenkins o do its job
[14:01:28] <joal>	 And let's debug hat repartition thing
[14:02:54] <joal>	 hm ottomata - where is that patch with convert?
[14:03:07] <wikibugs>	 (03PS3) 10Ottomata: Get smart and hacky about compatible type casting [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418052 (https://phabricator.wikimedia.org/T189332)
[14:03:17] <wikibugs>	 (03PS1) 10Ottomata: Fix hivePartitionPath to print proper partition Location [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418916
[14:06:43] <ottomata>	 joal this one? https://gerrit.wikimedia.org/r/#/c/410942/
[14:07:59] <ottomata>	 this is code that added convert function https://gerrit.wikimedia.org/r/#/c/410241/6/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/jsonrefine/SparkSQLHiveExtensions.scala
[14:10:47] <joal>	 Oh ! It's been merged
[14:10:54] <joal>	 ok, reading agian
[14:11:54] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10User-Elukey: Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo - https://phabricator.wikimedia.org/T189464#4043165 (10Gehel)
[14:15:57] <elukey>	 very nice that https://github.com/ben-manes/caffeine is suggested for recent versions of druid rather than the local cache
[14:15:58] <joal>	 ottomata: I think the repartitioning occurs when applying transform functions
[14:17:16] <joal>	 elukey: I didn't of caffeine !
[14:17:36] <ottomata>	 Oh joal intersting
[14:17:37] <ottomata>	 trying
[14:17:38] <joal>	 elukey: Now we have j8, I'll patch our LRUCache in refinery-source for a caffeine one :)
[14:17:55] <joal>	 ottomata: those transform functions make use of DF api
[14:19:36] <elukey>	 what I currently don't like is the GC timings for the historical
[14:19:44] <elukey>	 because they spikes to seconds
[14:20:10] <joal>	 elukey: Right
[14:20:13] <elukey>	 (young gen if I am seeing correctly)
[14:20:38] <joal>	 In term of load, it's not that bad
[14:20:41] <elukey>	 also, we have set broker cache to ~2G of heap size but metrics show ~4 ?
[14:20:44] <joal>	 elukey: --^ https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?orgId=1&var-server=druid100%5B456%5D&var-network=eth0&from=now-12h&to=now&refresh=1m
[14:20:56] <joal>	 elukey: have we restarted since?
[14:21:03] <ottomata>	 joal:  you are right, without transform funcs, 3 files.
[14:21:05] <ottomata>	 o
[14:21:06] <ottomata>	 ok
[14:21:08] <ottomata>	 so, what can we do?
[14:21:15] <joal>	 ottomata: I have a patch on the way :)
[14:21:44] <elukey>	 joal: restarted ?
[14:21:56] <joal>	 since the patch for 4G
[14:22:00] <ottomata>	 :)
[14:22:32] <elukey>	 joal: I am pretty sure we did
[14:22:52] <elukey>	 and we haven't patched to 4G
[14:23:09] <elukey>	   # Increase druid broker query cache size to 2G.
[14:23:09] <elukey>	   # TBD: Perhaps we should also try using memcached?
[14:23:09] <elukey>	   druid.cache.sizeInBytes: 2147483648
[14:23:41] <ottomata>	 joal
[14:23:52] <ottomata>	 fyi, it is specifically the deduplicate function that does it
[14:23:59] <ottomata>	 dropDuplicates
[14:24:10] <ottomata>	 joal:  i could get num partitions before
[14:24:14] <ottomata>	 and then repartition after drop duplicates back to orig
[14:24:21] <ottomata>	 not sure what you are gonna do :p
[14:24:26] <ottomata>	 maybe that ^ ?
[14:25:23] <joal>	 to the point ottomata :)
[14:25:25] <wikibugs>	 (03PS1) 10Joal: Update Refine partitioning strategy [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418919
[14:25:28] <joal>	 ottomata: --^
[14:26:05] <ottomata>	 nice joal :)
[14:26:21] <joal>	 ottomata: Since the point of transform functions is to provide extendable code, let's not enforce RDD only API :)
[14:26:33] <ottomata>	 waiting for jenkins, then will merge :)
[14:26:41] <joal>	 cool
[14:27:49] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Fix hivePartitionPath to print proper partition Location [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418916 (owner: 10Ottomata)
[14:28:07] <ottomata>	 yargghhh joal i have meetings today from standup +4 hours!  
[14:28:14] <joal>	 crap
[14:28:18] <joal>	 hm
[14:28:22] <elukey>	 2018-03-10T22:38:52,481 WARN org.eclipse.jetty.server.ServerConnector:
[14:28:25] <elukey>	 java.io.IOException: Too many open files
[14:28:25] <joal>	 ottomata: Let's merge and deploy now
[14:28:28] <elukey>	 lovely
[14:28:37] <joal>	 elukey: druid?A
[14:29:12] <elukey>	 druid1004's broker
[14:29:16] <elukey>	 so the processs has Max open files            4096                 4096                 files
[14:29:29] <ottomata>	 joal ok, but i kinda want to be ready when json refine changes,... the puppet cron uses the unversioned symlink
[14:29:32] <ottomata>	 hm
[14:29:38] <ottomata>	 i can make a patch to use versioned
[14:29:43] <ottomata>	 everything else uses versioned, right?
[14:30:06] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Update Refine partitioning strategy [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/418919 (owner: 10Joal)
[14:30:11] <joal>	 ottomata: hm, I don't think so 
[14:30:25] <joal>	 ottomata: straming job was using simlink, I think camus is as well
[14:30:43] <elukey>	 joal: ah no wait for some reason the logs stopped at 2018-03-10T22:38:52
[14:31:03] <ottomata>	 joal:  but those ones are unchnaed, right?
[14:31:06] <joal>	 elukey: probably because of that specific error :)
[14:31:15] <joal>	 unchaned?
[14:32:28] <elukey>	 !log restart druid-broker on druid1004 - no /var/log/druid/broker.log after 2018-03-10T22:38:52 (java.io.IOException: Too many open files_
[14:32:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:33:24] <joal>	 elukey: I don't understand that issue thoug
[14:33:56] <joal>	 elukey: from what I read, druid uses files when doing group-by, which we normally don't do
[14:34:15] <joal>	 elukey: maybe the request that broke it didn't come from AQS but from superset?
[14:34:35] <elukey>	 on druid1005 it does this
[14:34:36] <elukey>	 2018-03-12T14:26:57,097 WARN org.eclipse.jetty.server.HttpChannel: Could not send response error 500: java.io.IOException: Stream closed
[14:34:38] <wikibugs>	 (03PS1) 10Cicalese: Added queries of PHP version by MediaWiki version. [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/418922
[14:34:39] <elukey>	 2018-03-12T14:26:58,736 WARN org.eclipse.jetty.servlet.ServletHandler: /druid/v2/
[14:34:42] <elukey>	 java.io.IOException: Stream closed
[14:34:44] <elukey>	 all different :D
[14:34:57] <joal>	 yay, awesomeness of distributed system
[14:35:05] <ottomata>	 joal:  starting to do release
[14:36:26] <joal>	 ottomata: let me know if you wish me to help
[14:36:51] <joal>	 ottomata: also, I need to provide a patch for refinery before deploy (adding new jar version for mobile-apps session job)
[14:36:52] <ottomata>	 https://integration.wikimedia.org/ci/job/analytics-refinery-release/95/
[14:36:53] <elukey>	 joal: sockets are also counted under open files no?
[14:37:00] <ottomata>	 ok
[14:37:12] <joal>	 elukey: very possible
[14:37:16] <ottomata>	 joal:  perhaps I let you deploy refinery
[14:37:25] <joal>	 ottomata: works for me
[14:38:37] <joal>	 ottomata: we already have v0.0.58
[14:38:52] <joal>	 ottomata: have you forced a rebuild before feleasing?
[14:39:59] <joal>	 Ah ottomata - My mistake, I read the first commit of the list, not the provided version
[14:43:23] <elukey>	 other interesting thing:
[14:43:25] <elukey>	 elukey@druid1006:/var/log/druid$ sudo du -hs /var/lib/druid/segment-cache/*
[14:43:28] <elukey>	 2.3M	/var/lib/druid/segment-cache/info_dir
[14:43:31] <elukey>	 157G	/var/lib/druid/segment-cache/mediawiki_history_reduced
[14:43:37] <elukey>	 max size of the segment cache is 2T
[14:47:22] <joal>	 elukey: I think I don't understand :(
[14:48:12] <ottomata>	 joal:  release finished
[14:48:32] <wikibugs>	 (03PS1) 10Joal: Update mobile-app session job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/418925 (https://phabricator.wikimedia.org/T184768)
[14:48:38] <joal>	 ottomata: --^ if you don' mind
[14:49:43] <elukey>	 joal: in theory we have on every druid host a separate partition that has ~2.4T of free space, that is mounted as /var/lib/druid
[14:49:57] <joal>	 ok
[14:50:08] <elukey>	 joal: in there there should be our segment cache, that we set maximum at ~2.4T
[14:50:31] <elukey>	 buuut we use only 150G - is it because mw history is not that big
[14:50:38] <joal>	 elukey: correct
[14:50:41] <elukey>	 or because for some reason we are not caching a lot?
[14:50:52] <joal>	 The reduced data is no that big
[14:51:06] <elukey>	 do we know how much are all the segments ?
[14:51:24] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Update mobile-app session job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/418925 (https://phabricator.wikimedia.org/T184768) (owner: 10Joal)
[14:51:25] <joal>	 well, 150Gb of very condensed data in columnar storage is actually pretty big, but not in size :)
[14:51:26] <ottomata>	 merged joal
[14:51:31] <joal>	 Thanks ottomata 
[14:53:01] <joal>	 ottomata: we need to add the jars to refinery - Do you launch jenkins?
[14:54:54] <ottomata>	 oh
[14:55:02] <ottomata>	 oh forgot that step
[14:55:02] <ottomata>	 doing
[14:58:05] <wikibugs>	 10Analytics: pyspark2 different versions in Driver and Workers - https://phabricator.wikimedia.org/T189497#4043377 (10diego)
[14:59:11] <joal>	 ottomata: are you ready for me to deploy refinery (therefore the jars and symlinks) ?
[14:59:39] <ottomata>	 https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars/66/ yes!
[14:59:54] <ottomata>	 	1.	Add refinery-source jars for v0.0.58 to artifacts (https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars/66/changes#detail0)
[14:59:54] <ottomata>	  ?
[15:00:36] <ottomata>	 ok it did it
[15:00:38] <ottomata>	 yeah joal am ready
[15:00:38] <ottomata>	 do it
[15:00:43] <joal>	 ok ottomata 
[15:03:20] <ottomata>	 a-team standup!
[15:03:33] <fdans>	 goddammit
[15:03:46] <fdans>	 sorry
[15:05:07] <ottomata>	 milimetric: ?
[15:05:26] <ottomata>	 oh you are off?
[15:07:22] <joal>	 !log Deploy refinery from scap
[15:07:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:09:09] <joal>	 !log Deploy refinery onto hdfs
[15:09:10] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:09:38] <wikibugs>	 10Analytics-Tech-community-metrics, 10Developer-Relations (Apr-Jun-2018): Explain decrease in number of patchset authors for same time span when accessed 3 months later - https://phabricator.wikimedia.org/T184427#4043440 (10Aklapper)
[15:10:42] <joal>	 ottomata: refiner deployed - jars are new
[15:11:44] <wikibugs>	 (03PS3) 10Joal: Correct oozie mediawiki-reduced job dependency [analytics/refinery] - 10https://gerrit.wikimedia.org/r/417994 (https://phabricator.wikimedia.org/T189448)
[15:12:08] <ottomata>	 gr8
[15:18:45] <wikibugs>	 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Check detached accounts in DB with same username for "mediawiki" and "phab" sources but different uuid's (and merge if connected) - https://phabricator.wikimedia.org/T170091#4043473 (10Aklapper)
[15:25:32] <joal>	 ottomata: I messed up my patch with mobile_apps
[15:25:36] <joal>	 fixing now
[15:26:51] <wikibugs>	 (03PS1) 10Joal: Fix bug in mobile_apps session job conf [analytics/refinery] - 10https://gerrit.wikimedia.org/r/418931
[15:26:55] <joal>	 ottomata: --^
[15:37:27] <wikibugs>	 (03CR) 10Ottomata: [V: 032 C: 032] Fix bug in mobile_apps session job conf [analytics/refinery] - 10https://gerrit.wikimedia.org/r/418931 (owner: 10Joal)
[15:46:29] <mforns>	 heeey team :] sorry for being late, I missed the DST change and thought meetings were 1 hour later...
[15:46:59] <mforns>	 in spain only in 2 weeks...
[15:48:56] <elukey>	 joal: https://github.com/wikimedia/restbase/pull/965
[15:49:06] <elukey>	 mforns too --^
[15:56:58] <mforns>	 elukey, left a comment cause I think there's one limit that slipped, but makes sense and looks good!
[15:59:19] <elukey>	 checking!
[16:21:47] <wikibugs>	 (03PS4) 10Joal: Correct oozie mediawiki-reduced job dependency [analytics/refinery] - 10https://gerrit.wikimedia.org/r/417994 (https://phabricator.wikimedia.org/T189448)
[16:22:06] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/417994 (https://phabricator.wikimedia.org/T189448) (owner: 10Joal)
[16:26:11] <joal>	 !log Deploying refinery again to provide patch for mobile_apps_session_metric job
[16:26:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:28:49] <dsaez>	 hi joal: can a I make you a quick question about the parquet dumps?
[16:28:55] <joal>	 sure dsaez 
[16:29:39] <dsaez>	 I'm trying to get the timestamp for each revision. I'm doing a query on hive, and then try to join, but I get a memory error
[16:30:39] <dsaez>	 makes sense given that there are around 700M revisions
[16:30:56] <joal>	 dsaez: the question is more about: what do you do wih those?
[16:31:44] <dsaez>	 I run a regex on all the revisions (amazingly fast in spark) 
[16:32:21] <dsaez>	 But there is no timestamp 
[16:32:42] <dsaez>	 I need the timestamp. Because I'm creating a temporal link graph
[16:32:43] <joal>	 dsaez: Indeed I have removed timestamp from the dumps - that was a mistake!
[16:33:29] <joal>	 dsaez: joining between your subset of revisions already processed and mediawiki-history is what I'd do
[16:34:11] <joal>	 dsaez: In the meantime I'm gonna update the XMLConverter to add timestamp to it
[16:34:15] <dsaez>	 I'm doing that, but getting the memory error java.lang.OutOfMemoryError: Java heap space
[16:34:24] <joal>	 dsaez: spark?
[16:34:30] <dsaez>	 joal: that sounds great
[16:34:34] <dsaez>	 yes
[16:34:38] <dsaez>	 pyspark
[16:34:39] <joal>	 hm
[16:35:21] <joal>	 ok a few hints here and there dsaez - We've done some experiments (well, the discovery team did ;) - For those kind of big stuff using scala-spark is really better
[16:35:37] <joal>	 we've an improvement of 10x between pyspark and scala-spark depending of the work
[16:36:02] <joal>	 Also, the question is: how much memory o you give to your workers?
[16:36:05] <joal>	 and driver?
[16:36:16] <dsaez>	 joal: I now that scala is much more efficient, but basically I'm doing one-time usage code, and also I should share this with the research community  and they use python
[16:36:44] <dsaez>	 joal: I'm not explicitly given the memory, because this: https://phabricator.wikimedia.org/T189497
[16:37:03] <dsaez>	 I'm just running pypsark --master yarn
[16:38:13] <joal>	 dsaez: you run spark from a notebook?
[16:38:31] <dsaez>	 joal: yes!
[16:38:51] <dsaez>	 no good idea?
[16:38:54] <joal>	 dsaez: And from the ticket, you run spark2
[16:39:03] <dsaez>	 joal: yes
[16:39:14] <joal>	 dsaez: just making sure I'm up to date on settings
[16:39:25] <joal>	 dsaez: I never use pyspark, so I can't really say
[16:39:39] <joal>	 For OOM in spark, giving it more memory is the way to go
[16:40:04] <dsaez>	 I see
[16:40:18] <joal>	 And since the request is about a huge join, I'd say give more memory to your master, and some more memory-overhead
[16:40:27] <joal>	 s/master/driver
[16:41:05] <dsaez>	 joal: ok, but when declaring the workers, I got that version problem
[16:41:51] <joal>	 dsaez: It's actually super bizarre that it doesn't fail without executor-memory setting if python versions are different
[16:42:24] <dsaez>	 joal: maybe is all working the on the driver?
[16:42:45] <joal>	 dsaez: I don't think so
[16:42:56] <joal>	 dsaez: Or the data you're processing is small enough
[16:43:30] <dsaez>	 joal: the full dump enwiki 2018-01
[16:43:35] <dsaez>	 and goes pretty fast
[16:44:10] <joal>	 dsaez: do you have a spark instance currently running/
[16:44:11] <joal>	 ?
[16:44:17] <joal>	 From the cluster, I'd say no
[16:44:26] <dsaez>	 joal: I've just stop
[16:44:38] <joal>	 dsaez: Can you start it again please?
[16:44:40] <dsaez>	 sure
[16:44:44] <joal>	 Not the job, the spark app
[16:44:47] <joal>	 :)
[16:45:07] <dsaez>	 joal: done, this is the way that I get no error
[16:45:36] <joal>	 dsaez: ok, so you're running in non-distributed mode
[16:45:43] <joal>	 dsaez: https://yarn.wikimedia.org/cluster/scheduler
[16:45:47] <dsaez>	 ok, that's why
[16:46:03] <dsaez>	 I'll try to install a python2.7 virtualenv
[16:46:22] <dsaez>	 joal: ok, I'll run now In the way that I get the error
[16:46:26] <dsaez>	 the version error
[16:46:44] <joal>	 dsaez: try without executor-memory, but with --master yarn only
[16:47:03] <dsaez>	 joal: ok, done
[16:47:21] <joal>	 dsaez: no yarn job
[16:47:31] <joal>	 dsaez: From which machine do you launch it?
[16:47:44] <wikibugs>	 10Analytics: pyspark2 different versions in Driver and Workers - https://phabricator.wikimedia.org/T189497#4043377 (10Ottomata) Ah, python3 is installed!  I think the issue is that pyspark2 from stat1005 is stretch with Python 3.5, but most workers are Jessie with Python 3.4.  We're in the process of adding more...
[16:47:47] <dsaez>	 stat1005
[16:48:13] <dsaez>	 joal: now is running
[16:48:40] <joal>	 dsaez: no yarn job
[16:48:56] <dsaez>	 ¿?
[16:49:02] <dsaez>	 let's do it again
[16:49:06] <joal>	 dsaez: How do you launch spark?
[16:49:11] <joal>	 from a terminal?
[16:49:18] <elukey>	 the druid bot stopped in the meantime
[16:49:26] <elukey>	 but I'll push anyway for the rate limit
[16:49:28] <joal>	 :)
[16:49:38] <dsaez>	 joal> pyspark2  --master yarn 
[16:49:48] <joal>	 elukey: afer having downloaded the full history of top pages per day, it finally stopped :)
[16:49:54] <dsaez>	 joal> what about now?
[16:50:24] <joal>	 nope dsaez - nothing
[16:50:35] <joal>	 trying myself dsaez 
[16:51:22] <joal>	 dsaez: look at the yarn UI (link above)
[16:52:01] <joal>	 I exectly did what you did dsaez: pyspark2 --master yarn
[16:52:25] <ottomata>	 (joal i added to analytics-goals etherpad: 		Spark 2, deprecate and remove spark 1.6 from cluster.)
[16:52:32] <ottomata>	 for next quarter :)
[16:52:34] <joal>	 !log Deploying refinery on HDFS for mobile_apps patch
[16:52:35] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:52:44] <joal>	 ottomata: Many thanks :)
[16:52:45] <dsaez>	 joal: application_1520532368078_12129	dsaez
[16:53:01] <joal>	 Here we go dsaez :)
[16:53:10] <joal>	 dsaez: what have you changed?
[16:53:26] <dsaez>	 I've just cleaned the cookies of my browser
[16:53:49] <dsaez>	 apparently, when the jupyter notebook restart the kernel (when reconnect) kills the spark process
[16:54:07] <joal>	 Ah
[16:54:14] <joal>	 That's weird :)
[16:54:38] <dsaez>	 joal: now, I'll execute a udf ...let's see if gives the error
[16:54:55] <joal>	 However dsaez - If this is running that way (disributed etc) you should be able to provide some conf settings
[16:55:25] <dsaez>	 joal: now I get the error Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
[16:55:33] <joal>	 !log Restart mobile_apps_session_metrics
[16:55:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:56:13] <dsaez>	 joal: i see to possible solutions, install python3 on the workers, or that I create a virtualenv with python2.7 on the stat1005
[16:56:59] <dsaez>	 joal: let me try the last one, that it would be the easiest 
[16:57:10] <ottomata>	 dsaez: see comment on your ticket
[16:57:12] <ottomata>	 python 3 is installed
[16:57:14] <ottomata>	 but python 3.4
[16:57:15] <ottomata>	 but
[16:57:23] <ottomata>	 try running from stat1004
[16:57:27] <ottomata>	 which has python3.4 too
[16:57:37] <ottomata>	 but, i think that is not the problem, since it says python version is 2.7
[16:57:43] <ottomata>	 i'm not sure how PYSPARK_PYTHON works
[16:57:54] <dsaez>	 ottomata, let me check
[16:58:04] <ottomata>	 but maybe by setting it to jupyter, you are causing workers to use python 2.7 
[16:58:09] <ottomata>	 but your driver is using 3
[16:58:59] <dsaez>	 ottomata, I'll check the jupyter documentation ... but don't think that's possible
[16:59:18] <ottomata>	 well Python in worker has different version 2.7 seems to indicate that worker is running python 2.7
[16:59:20] <ottomata>	 dunno why though
[17:00:04] <elukey>	 https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
[17:00:31] <wikibugs>	 (03CR) 10Mforns: [C: 032] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/413689 (owner: 10Joal)
[17:00:55] <wikibugs>	 (03PS2) 10Mforns: Use an accumulator to count in spark Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/413689 (owner: 10Joal)
[17:01:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Use an accumulator to count in spark Refine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/413689 (owner: 10Joal)
[17:01:13] <dsaez>	 ottomata, joal: at least using python2.7 on my virtualenv doesn't help
[17:02:25] <wikibugs>	 10Analytics, 10Cloud-VPS, 10EventBus, 10Services (watching): Set up a Cloud VPS Kafka Cluster with replicated eventbus production data - https://phabricator.wikimedia.org/T187225#4043848 (10Ottomata) Hm, possibly.  Something like 3 nodes, 2TB+ storage each, 32G+ RAM each, 12ish CPUs.  I haven't loved maint...
[17:03:03] <elukey>	 ottomata: mirror-maker graphs are a bit flat :(
[17:04:34] <joal>	 dsaez: I tried to run a simple SQL query using pyspark2, it worked
[17:04:40] <joal>	 what are you trying to do?
[17:04:54] <dsaez>	 queries work
[17:05:11] <dsaez>	 joal: I get the error running a regular expression
[17:06:37] <joal>	 dsaez: can you paste some code please?
[17:06:52] <dsaez>	 from pyspark.sql.functions import udf
[17:06:52] <dsaez>	 import re
[17:06:52] <dsaez>	 def getWikilinks(wikitext): #UDF to get wikipedia pages titles
[17:06:52] <dsaez>	     links  = re.findall("\[\[(.*?)\]\]",wikitext) #get wikilinks
[17:06:52] <dsaez>	     titles = [link.split('|')[0] for link in links] #get pages
[17:06:53] <dsaez>	     return titles
[17:06:55] <dsaez>	 udfGetWikilinks = udf(getWikilinks)
[17:07:04] <dsaez>	 df = spark.read.parquet('hdfs:///user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01/cawiki')
[17:07:09] <dsaez>	 df2 = df.withColumn('wikilinks',udfGetWikilinks(df.revision_text))
[17:07:13] <dsaez>	 df2.show()
[17:12:01] <joal>	 dsaez: I have your exact function working well with rdds
[17:12:11] <joal>	 I'm assuming the problem comes from udf
[17:12:34] <dsaez>	 joal: will try, I can code this with lambda function 
[17:13:02] <joal>	 dsaez: https://gist.github.com/jobar/aa576e0ba736c1ed0a4b4621187afeda
[17:15:08] <dsaez>	 joal: I get Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
[17:15:33] <dsaez>	 ups, same error
[17:15:34] <dsaez>	 Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
[17:15:41] <dsaez>	 joal: are you using any virtualenv?
[17:15:56] <joal>	 dsaez: I suggest running pyspark out of a notebook :)
[17:16:23] <dsaez>	 joal:  yes! I think that's the point! I'll try and let you know, thank you very much
[17:17:14] <joal>	 dsaez: your exact code has worked fine for me out of a workbook
[17:17:53] <dsaez>	 joal, are you using a interactive-shell or just spark-submit ?
[17:18:04] <joal>	 dsaez: interactive sheel
[17:18:09] <joal>	 s/sheel/shell
[17:18:18] <joal>	 shouldn't mnake any difference dsaez 
[17:18:41] <dsaez>	 I know... weird notebooks
[17:19:17] <dsaez>	 prety but crazy
[17:19:46] <dsaez>	 joal: now I managed to make it work with the notebook using python2.7!
[17:19:57] <joal>	 weird
[17:22:17] <ottomata>	 elukey:  (in meetign) HMMM, MM logs look pretty goood..
[17:22:32] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4043978 (10mobrovac) These jobs are not high-traffic, so consolidating the job runners and spreading the load all over them sounds lik...
[17:23:25] <ottomata>	 elukey:  going to bounce one of the brokers..
[17:23:31] <ottomata>	 sorry
[17:23:32] <ottomata>	 not brokers
[17:23:34] <ottomata>	 mm instances :p
[17:24:32] <wikibugs>	 10Analytics: pyspark2 different versions in Driver and Workers - https://phabricator.wikimedia.org/T189497#4043992 (10diego) @Ottomata and @JAllemandou  I found a work-around by creating an python2.7 virtualenv on stat1005.  I think that is the easiest solution right now.  Updating python3 on the workers might b...
[17:27:08] <elukey>	 ottomata: ok :D
[17:27:34] <elukey>	 ottomata: just added a role::eventlogging::analytics::legacy to eventlog1001, with only the forwarder
[17:27:41] <elukey>	 enabled puppet and removed all the other configs
[17:28:01] <elukey>	 so there is no risk to accidentally have both running at the same time
[17:31:01] <ottomata>	 gr8
[17:31:07] <ottomata>	 elukey:  i just bounced mm instance on 1023
[17:31:17] <ottomata>	 it started producing, and also seems to be the only one doing so
[17:31:23] <ottomata>	 even though others do 'own' partitions
[17:32:22] <ottomata>	 elukey:  i'm really not sure what the heckay is happening here.
[17:32:31] <ottomata>	 i'm going to remove the mw job  topics from mm replication to jumbo
[17:32:50] <joal>	 ok ottomata - mobile_app job fixed - sorry for the mess
[17:33:33] <elukey>	 ottomata: ack
[17:34:28] <joal>	 !log Restart mediawiki-history-reduced oozie job to add a dependency
[17:34:28] <ottomata>	 i didn't even know it was broken :p!
[17:34:29] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:35:36] <ottomata>	 or API version? i'll try that first.
[17:36:31] <ottomata>	 oh elukeyhmm
[17:36:33] <ottomata>	 actually
[17:36:33] <ottomata>	 first
[17:36:38] <ottomata>	 we should fix the webrequest api version
[17:36:41] <ottomata>	 remember?  i set that incorrectly
[17:36:46] <ottomata>	 all vks are using the wrong version
[17:36:48] <ottomata>	 right now
[17:38:11] <elukey>	 ottomata: yep but it shouldn't be a huge deal no? we should switch to no enforced version though (so enable auto-negotiation)
[17:38:19] <ottomata>	 right
[17:38:40] <ottomata>	 https://gerrit.wikimedia.org/r/#/c/418959/
[17:38:52] <ottomata>	 i'm going to manually apply on canaray and restart vk
[17:39:54] <elukey>	 should we just remove that bit since we are using jumbo only?
[17:40:00] <ottomata>	 yes
[17:40:02] <ottomata>	 well
[17:40:06] <ottomata>	 oh we should yaeh
[17:40:13] <ottomata>	 elukey:  canary is still using analytics!
[17:40:15] <ottomata>	 i thought it used text...
[17:40:46] <ottomata>	 role text
[17:40:49] <elukey>	 ahahha lol well let's just switch it to jumbo too
[17:41:11] <ottomata>	 AHH, because it uses role function on canary
[17:41:18] <ottomata>	 we need to put it in hiera canary.yaml
[17:41:42] <elukey>	 https://gerrit.wikimedia.org/r/#/c/418959/2/modules/profile/manifests/cache/kafka/eventlogging.pp - eventlogging still needs this right?
[17:42:09] <wikibugs>	 (03PS6) 10Joal: Add XmlConverter spark job [analytics/wikihadoop] - 10https://gerrit.wikimedia.org/r/361440
[17:42:14] <ottomata>	 yes
[17:42:38] <ottomata>	 OH sorry didn't mean to remove it...
[17:44:28] <ottomata>	 elukey:  looks ok on 1052
[17:44:35] <ottomata>	 (i modifeid it there too)
[17:44:39] <ottomata>	 https://gerrit.wikimedia.org/r/#/c/418959/
[17:44:52] <ottomata>	 yar
[17:44:53] <ottomata>	 still wrong
[17:45:21] <ottomata>	 thaaar we go
[17:45:50] <ottomata>	 elukey:  +1?  
[17:46:39] <elukey>	 lemme check
[17:46:55] <elukey>	 gogogog
[17:47:03] <ottomata>	 ook
[17:48:38] <wikibugs>	 (03CR) 10Cicalese: "This is ready to be reviewed/merged. Thanks!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/418922 (owner: 10Cicalese)
[18:06:30] <elukey>	 ottomata: afaics all vk instances restarted, and no issue reported/visible in our metrics
[18:06:40] <ottomata>	 great
[18:06:47] <elukey>	 did you restart all of them at once? 
[18:07:11] <elukey>	 ah no they seems a bit staggered
[18:07:15] <elukey>	 super good
[18:07:16] <ottomata>	 i ran puppet via cumin
[18:07:23] <ottomata>	 but staggered per cluster
[18:07:30] <elukey>	 yep my bad, I was just curious no concern
[18:07:58] <elukey>	 all right, I'd log off if nothing is outstanding
[18:09:55] <ottomata>	 elukey:  i'm really just trying things now, i'm wondering if local buffers are filling up too fast, and then blocking?  going to increase producer buffer.memory and see
[18:10:49] <ottomata>	 !log bouncing kafka mirrormaker for main-eqiad -> jumbo with buffer.memory=128M
[18:10:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:15:06] <elukey>	 (Peter is deploying restbase now, so edit-related endpoints will be rated limited to 25rps now)
[18:17:38] <ottomata>	 !log bouncing kafka mm eqiad -> jumbo witih acks=1
[18:17:40] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:23:28] <elukey>	 seems going well now right?
[18:24:19] <elukey>	 going off, will read laterz :)
[18:24:20] * elukey off!
[18:26:41] <ottomata>	 inteesting elukey yeah, but by making acks=1
[18:26:43] <ottomata>	 which is not ideal
[18:26:49] <ottomata>	 laters
[18:31:20] <ottomata>	 nope, not fine, just took longer
[18:43:47] <wikibugs>	 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, 10Services (doing): FY17/18 Q3 Program 8 Services Goal: Migrate two high-traffic jobs over to EventBus - https://phabricator.wikimedia.org/T183744#4044428 (10Pchelolo)
[18:43:51] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, 10Services (done): Support claimTTL and rootClaimTTL in change-prop - https://phabricator.wikimedia.org/T189303#4044425 (10Pchelolo) 05Open>03Resolved
[19:03:59] <joal>	 heya ottomata - Have switched refine to new code?
[19:06:16] <ottomata>	 abou tto!
[19:06:24] <ottomata>	 been dealing with mirror maker
[19:07:56] <wikibugs>	 (03CR) 10Mforns: [C: 032] "Looks good! Nice use of the explode_by feature :]" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/418922 (owner: 10Cicalese)
[19:39:23] <ottomata>	 !log deployed new Refine jobs (eventlogging, eventbus, etc.) with  deduplication and geocoding and casting
[19:39:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:43:17] <wikibugs>	 10Analytics-Kanban, 10Google-Summer-of-Code (2018): [Analytics] Improvements to Wikistats2 front-end - https://phabricator.wikimedia.org/T189210#4044585 (10sahil505) @mforns - I would be interested to work on this project for GSoC 2018. I have explored some of the sub-tasks that have been listed above and have...
[19:45:59] <ottomata>	 haha dsaez, yt?
[19:46:11] <ottomata>	 you have somehow requested ALL of the memory on the cluster
[19:46:13] <ottomata>	 almost all
[19:46:21] <dsaez>	 ottomata yeah!
[19:46:21] <ottomata>	 2TB
[19:46:28] <ottomata>	 you meant to do that?
[19:46:37] <dsaez>	 I am the king
[19:46:40] <ottomata>	 hahah
[19:46:44] <dsaez>	 no, sorry, let me kill stuff
[19:47:00] <dsaez>	 better now?
[19:48:08] <ottomata>	 still says 2022400 allocated, its ok though (maybe) i'm going to make the jobs i'm trying to run production queue
[19:48:15] <ottomata>	 hopefully htat wil pre-empt some of your containers
[19:49:12] <dsaez>	 I'm joining two dataframes of 500M lines each... anyhow, seems to be too much memory ( I had another process runing but I've just killed)
[19:49:50] <dsaez>	 I'm browsing in yarn.wikimedia.org, but where exactly can you see the memory usage per user?
[19:50:46] <ottomata>	 not per user, but per job
[19:50:48] <ottomata>	 https://yarn.wikimedia.org/cluster/app/application_1520532368078_12332
[19:50:51] <ottomata>	 https://yarn.wikimedia.org/cluster/scheduler
[19:51:01] <ottomata>	 i'm looking at ^ 3rd column from right
[19:51:06] <ottomata>	 Allocated Memory MB
[19:51:19] <dsaez>	 I see
[19:51:50] <dsaez>	 should I kill that process? not sure how long it will take... as I told it is just join 
[19:52:12] <dsaez>	 maybe I can write to hive and join there, I image that would be better
[19:52:35] <ottomata>	 hm, i dunno, maybe, are you using spark dynamic allocation?
[19:52:55] <ottomata>	 or just requesting 42 workers with 48G of ram each?
[19:53:18] <joal>	 dsaez: 42Gb of ram per executor is really a lot
[19:53:35] <dsaez>	 ok, i'll kill this, sorry, I was just doing some trial and error
[19:53:55] <ottomata>	 i dunno if he actually requested that, i just see 43 exectors and ~2TB RAM used
[19:54:10] <ottomata>	 maybe 42 executors, one driver?
[19:54:13] <joal>	 dsaez: we usually look at the ratio RAM/CPU - The cluster is configured with 2x more RAM than core, leading to ~2Gb RAM per CPU core
[19:54:30] <joal>	 ottomata: indeed, 42 exsec, 1 driver, 42Gb RAM each
[19:54:47] <dsaez>	 killed.
[19:54:50] <joal>	 Thanks :)
[19:56:20] <joal>	 dsaez: I suggest you use 8Gb per executor, 2 cores per executor, and 8Gb for the driver as well. It would also be good, with those settings, to add: --conf spark.dynamicAllocation.maxExecutors=128, leading to max 1/2 the cluster used for the job
[19:56:49] <dsaez>	 sorry guys, that was wrong debuging strategy 
[19:56:58] <dsaez>	 joal: ok, sounds reasonable
[19:57:23] <ottomata>	 thank you dsaez!
[19:57:37] <dsaez>	 i was just enjoying the power of the cloud :D 
[19:58:00] <joal>	 dsaez: I love that feeling of CPUs heating :)
[19:58:40] <dsaez>	 ;)
[20:04:52] <joal>	 ottomata: how's the thing with new-Refine?
[20:12:05] <ottomata>	 running the first one now!  crons for other start in a bit
[20:12:14] <ottomata>	 its in cluster mode, so will see logs soon i guess... :)
[20:18:58] <ottomata>	 joal:  mostly good!
[20:19:08] <ottomata>	 in this run i got a cast exception for something we don't catch, but should be workable
[20:19:09] <ottomata>	 will add it
[20:19:13] <ottomata>	 (double -> long)
[20:19:34] <joal>	 ottomata: funny - Still the same schema?
[20:19:39] <ottomata>	 no different one
[20:20:08] <ottomata>	 hmm, actually
[20:20:10] <ottomata>	 it got past our checks
[20:20:59] <ottomata>	 which means hive type coercion should have worked
[20:21:04] <ottomata>	 its failling at parquet level
[20:21:17] <ottomata>	 maybe
[20:21:18] <ottomata>	  java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.Long
[20:21:18] <ottomata>	 	at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaLongObjectInspector.get(JavaLongObjectInspector.java:40)
[20:21:18] <ottomata>	 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$LongDataWriter.write(DataWritableWriter.java:398)
[20:22:26] <ottomata>	 geocoded_data column added just fine though woohooo
[20:26:50] <ottomata>	 coool, RefineMonitor should work too!
[20:28:31] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Patch-For-Review, 10User-Elukey: Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo - https://phabricator.wikimedia.org/T189464#4044688 (10Ottomata) Not sure what is happening still.    I thought that I had reduced the Batch Expired error by sett...
[20:30:45] <joal>	 ottomata: I got it - And actually that means we need to be good at casting ourselves
[20:31:06] <joal>	 ottomata: When we were reading the data using a predefined (merged) schema, casting was done at read time
[20:31:20] <wikibugs>	 10Analytics-Kanban, 10MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), 10Patch-For-Review: Record and aggregate page previews - https://phabricator.wikimedia.org/T186728#4044694 (10Ottomata) FYI, I just deployed the change that does geocoding (and deduplication) for EventLogging events in Hive....
[20:31:25] <joal>	 Now, casting is NOT done anymore (except for the specific cases already implemented
[20:31:48] <ottomata>	 ohhhhh
[20:31:55] <ottomata>	 hm
[20:32:02] <ottomata>	 but, shouldn't the HiveTypeCoercion thing handle this?
[20:32:04] <joal>	 makes sense
[20:32:16] <joal>	 ottomata: the coercion is made at type level, not value level
[20:32:43] <joal>	 So type coercion means the value should cast correctly, but it still needs to be done
[20:33:14] <ottomata>	 hmm
[20:33:35] <ottomata>	 not sure that makes sense to me, why would spark have this logic in it if it didn't use it?
[20:33:40] <ottomata>	 shouldn't htis work if you do something like
[20:34:06] <ottomata>	 sqlContext.sql("INSERT INTO ... VALUES(1.234, ...)"), where the first column is a Long?
[20:34:11] <ottomata>	 i will try!
[20:41:07] <joal>	 ottomata: thinking again about that type coercion thing
[20:41:24] <joal>	 The function we use is on types - It doesn't affect values
[20:41:44] <joal>	 However, it tells you that you should be able to safely cast from to another
[20:41:58] <joal>	 Which means casting would still need to be done
[20:42:33] <ottomata>	 joal
[20:42:38] <ottomata>	 bc?
[20:42:40] <ottomata>	 i just tested some stuff
[20:43:49] <ottomata>	 tested inserting doubles into a long hive column, it worked
[20:43:50] <joal>	 OMW
[21:41:24] <wikibugs>	 10Analytics, 10Performance-Team: Possible statsv corruption? - https://phabricator.wikimedia.org/T189530#4044794 (10Krinkle)
[22:01:48] <ottomata>	 haha joal we do not need to worry about this schema
[22:01:54] <ottomata>	 https://meta.wikimedia.org/w/index.php?title=Schema:InputDeviceDynamics
[22:02:14] <ottomata>	 "The data depends on the field name.
[22:02:30] <ottomata>	 I'm goign to blacklist this schema
[22:07:57] <wikibugs>	 (03CR) 10Cicalese: "Yes, absolutely, that was my intention - one file per MediaWiki version including multiple PHP versions per file. Thanks!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/418922 (owner: 10Cicalese)
[22:08:32] <wikibugs>	 (03CR) 10Cicalese: [V: 032] Added queries of PHP version by MediaWiki version. [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/418922 (owner: 10Cicalese)
[22:47:25] <wikibugs>	 (03CR) 10Mforns: [C: 032] "Hey joal, this code looks great! We'll have a *lot* of information about the reconstruction process. I left 2 typo comments, but +2 on my " (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) (owner: 10Joal)
[22:49:48] <CindyCicaleseWMF>	 mforns: Is there something else that will need to be done to deploy the new pingback queries, or will the .tsv files just automagically appear at some point soon? Thanks~
[22:50:11] <mforns>	 hi CindyCicaleseWMF :]
[22:50:29] <CindyCicaleseWMF>	 Hi mforns :-)
[22:51:32] <mforns>	 whenever your change is merged, and puppet deploys it to stat1006 (20 mins later) your report will start to execute
[22:51:54] <mforns>	 have you seen my comment on the review? is that OK for you?
[22:52:33] <mforns>	 oh, OK, reading your comment now
[22:52:46] <CindyCicaleseWMF>	 Excellent! I was pretty sure you had some wonderful deployment magic. Yes, I responded to your comment on gerrit. I'm glad you liked the use of explode_by ;-) And, I did realize it would generate multiple files in a subdirectory. I tested it locally before creating the patch.
[22:53:04] <mforns>	 perfect
[22:53:18] <CindyCicaleseWMF>	 Thanks so much for the review!
[22:53:38] <mforns>	 so yes, you should hopefully see report files being written soon, probably at the next full hour when reportupdater is called by cron
[22:54:02] <mforns>	 if not, please let us know
[22:54:09] <CindyCicaleseWMF>	 Cool. I'll work on the graph configuration now. Will do.
[22:54:47] <mforns>	 great :]
[23:02:37] <CindyCicaleseWMF>	 Yay! The new files are starting to appear.
[23:03:59] <mforns>	 :]