[00:04:00] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services (watching): Set up a Cloud VPS Kafka Cluster with replicated eventbus production data - https://phabricator.wikimedia.org/T187225#4045202 (10bd808) I don't know if we can set aside that much space on 3 different labvirts today or not, but we can check. The d... [01:10:53] 10Analytics-Kanban, 10Discovery-Analysis, 10Wikipedia-Android-App-Backlog, 10Patch-For-Review: Bug behavior of QTree[Long] for quantileBounds - https://phabricator.wikimedia.org/T184768#4045333 (10Mholloway) [08:10:17] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging servers to Stretch - https://phabricator.wikimedia.org/T114199#4045717 (10elukey) [08:15:55] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging servers to Stretch - https://phabricator.wikimedia.org/T114199#4045725 (10elukey) Updates: * eventlog1001 runs a special puppet role called `eventlogging::analytics::legacy` that enforces o... [08:44:36] 10Analytics, 10Analytics-Wikimetrics, 10Security-Reviews: security review of Wikimetrics {dove} - https://phabricator.wikimedia.org/T76782#819706 (10Bawolff) Umm, so this was filed in 2014. What is wikimetrics? Is it something that (still) needs a security review? [08:51:19] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#4045760 (10elukey) Current status of home dirs: ``` 115G ellery 241G mkroetzsch 648G ezachte 754G flemmerich 782G dsaez 950G mirrys ``` @leila, @Miriam, @diego: do you guy... [08:59:31] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#4045771 (10elukey) @ezachte thanks a lot for the massive space reduction! \o/ [09:11:57] miriam_: o/ [09:37:39] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4045827 (10elukey) I acked on icinga notebook100[34] systemd unit failures to avoid confusion for other people (expected I guess since the task is WIP... [09:48:12] 10Analytics, 10Collaboration-Team-Triage, 10DBA, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623#2080811 (10Marostegui) @Milimetric with all the clean up work done on EL servers, is this still a valid task? [10:01:59] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging servers to Stretch - https://phabricator.wikimedia.org/T114199#4045889 (10elukey) [10:08:05] 10Analytics-Kanban, 10MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), 10Patch-For-Review: Record and aggregate page previews - https://phabricator.wikimedia.org/T186728#4045894 (10phuedx) 💪 Excellent work, @Ottomata! [10:11:09] as FYI I am adding analytics1072 to the hadoop worker nodes [10:29:26] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4045940 (10elukey) This bit prevents the first couple of puppet runs to complete (and also yarn to start etc..): ``` Error: Could not... [10:51:07] an1073 same thing [11:06:38] there you go, two moar nodes added [11:23:05] 10Analytics, 10Analytics-Wikimetrics, 10Security-Reviews: security review of Wikimetrics {dove} - https://phabricator.wikimedia.org/T76782#4046101 (10Aklapper) > What is wikimetrics? https://phabricator.wikimedia.org/project/profile/631/ links to https://www.mediawiki.org/wiki/Analytics/Wikimetrics links to... [11:27:50] 10Analytics-Data-Quality, 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Please review: new WMDE public data set on stat1005 - https://phabricator.wikimedia.org/T187606#4046132 (10GoranSMilovanovic) Hi @mforns, the `asExtensionUpdate.csv` file - that one that needs to be made... [11:34:19] * elukey lunch! [11:47:54] ping joal / mforns / fdans: just making sure yall saw this https://gerrit.wikimedia.org/r/#/c/413265/ and knew that it was basically ready for review. The only part that doesn't work is the last druid step which I could use some advice from joal on, but I can figure it out myself when I'm back [11:47:59] elukey: I've just free 740G in stat1005 ;) [11:50:24] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#4046251 (10diego) @elukey: Done ;) now 66G /home/dsaez/ [11:54:11] 10Analytics, 10Collaboration-Team-Triage, 10DBA, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623#4046264 (10Milimetric) It looks to me like those tables still exist and there's still data on the box that analytics-slave points to, so yeah, I think they... [12:35:45] dsaez: awesome thanksss!!! [12:49:15] Hi lads - sorry late start today :) [12:49:25] Thanks elukey for moar nodes :) [12:49:45] milimetric: I'll review the thing and discuss with andrew the python thing - It's weird [12:51:22] thanks dsaez for using spark more than stat machines :) That's super great :) [13:09:30] o/ [13:09:33] \o [13:09:56] joal did we just high five? [13:10:12] ottomata: I guess so! [13:14:59] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: Restricting access for a collaboration nearing completion - https://phabricator.wikimedia.org/T189341#4039520 (10Ottomata) I had emailed Dario about this before, and told him it might be hard, but on second thought, I think it isn'... [13:18:10] elukey: moornnnin, el -> jumbo in a few mins? [13:18:29] ottomata: morninggggg [13:18:54] sure, I have some workers at home making some noise so if you don't mind I'd prefer over IRC [13:19:00] k [13:19:02] np [13:19:06] super thanks :) [13:20:45] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#4046529 (10Ottomata) Ah thanks, yeah, I meant to get back to this the next day but we got other thinnngnggs [13:24:43] elukey: i have no idea why mirror maker is ok now but not before [13:25:16] ottomata: those hosts understood that you were upset and they backed off [13:25:22] haha [13:25:24] i think the acks=1 change I have is still set in running mm [13:25:28] but, that didn't actually fix it [13:25:32] i saw the same error [13:25:35] with acks=1 [13:26:07] puppet is still disabled [13:26:23] the only other change is the api version change for varnishkafka [13:26:29] we should def merge the round robin thing for mm [13:26:32] yeah [13:26:42] ok elukey i'll work on that later today then and finish that up [13:26:47] maybe remove acks=1 again and see what hapepns [13:28:15] +1 [13:28:42] ottomata: I checked logs for the new hadoop worker nodes in service, except that warning for procfs I don't see anything weird [13:28:53] there are still 4 left to deploy [13:29:00] I can do it over the next two days if you are ok [13:30:27] ok yeah! [13:30:29] let's do it! [13:30:40] elukey, one thing we should look out for, is disparate python versions for pyspark users [13:31:01] i'm not sure how much it matters, but diego had a little problem yesterday, i think it might have ended up being unrelated [13:31:37] ah ok I wasn't aware of that :( [13:31:58] ottomata, elukey : I think Dan experienced a similar issue with one a sqoop jobs [13:32:03] hi teaaam [13:32:20] hiii [13:32:56] oops, wrote sqoop while I was thinking druid [13:33:10] ottomata: was it python3 versions or 2.7 vs 3.x ? [13:33:14] oozie druid loading jobs use #!/usr/bin/env python [13:33:29] for diego's py spark thing, i'm not sure, both? that's why it was confusing [13:33:33] And it seems to have failed to do so on an67 [13:33:45] he was getting an error about differing driver vs executor python versions, but it said 2.7 vs 3.5 i think [13:33:45] that's not on stretch [13:34:07] okok might be interesting to follow up [13:34:48] yeah, but i think it was not related in this case? not sure. I think his notebook configuration was launching python2 in one place and python 3 in another [13:34:59] which woudln't work if all nodes were just jessie either [13:35:14] anyway he apparently got it to work in pyspark on the CLI [13:35:17] so on an1067 we have 2.7.9 and 3.4.2 [13:35:21] aye [13:35:37] on an1071 2.7.13 and 3.5.3 [13:35:56] buut it should't be a big deal [13:36:21] did diego open a task? Just to know what error was returned to test [13:37:17] he did https://phabricator.wikimedia.org/T189497 [13:37:40] elukey: that task lacks context, joal might know more, he was talking with diego here [13:37:58] * elukey knows that joal always knows more :D [13:38:26] * joal wonders how people can even think that :-P [13:41:13] ottomata: Just did a test on the idea of using SQL generated with data casting [13:41:26] ottomata: I took the exact failure example we had with too-big number [13:42:08] ya? [13:42:31] ottomata: It casts ok, but we get a weird value instead of None -- -9223372036854775616 [13:42:58] both in spark1 and spark2 [13:43:20] hmm, ok, i think i'm ok with that [13:43:45] its not correct either way [13:44:13] hm, IMO null is beter here than a wrong but usable value [13:44:19] ottomata: one q about the zmq forwarder - how should we handle it? [13:45:06] elukey: i think just flip it asap [13:45:44] hmmmmmmmmm or, we could set it up on eventlog10002, consuming from jumbo, and then change the webperf hostname in https://gerrit.wikimedia.org/r/#/c/404773/ with all the other changes [13:47:48] I know you'll not like it buuuut what about waiting for the perf team to merge their code change to pull from Kafka? Timo told me that it should be due today/tomorrow [13:48:03] oh hm [13:49:34] elukey: i guess that's fine... [13:49:49] i dunno though, its a totally new thing [13:50:03] what if they have to roll back? i guess we could keep the zmq thing running consuming from jumbo for a bit [13:51:31] ottomata: can we have more than one forwarder on the same host? we could add another one to eventlog1001 [13:52:05] 10Analytics, 10Collaboration-Team-Triage, 10DBA, 10Notifications: Purge all Schema:Echo data after 90 days - https://phabricator.wikimedia.org/T128623#4046608 (10mforns) The Echo schema is present in EventLogging's purging white-list, see: https://github.com/wikimedia/puppet/blob/production/modules/profile... [13:52:10] but I have no idea how/when they'd need to flip it [13:52:41] this is why I proposed to wait, to better coordinate with Timo and avoid "ooops" [13:55:35] ya sure, different port though [13:55:50] allirght, waiting it is! [13:56:01] that'll let me focus on Mirror maker / notebooks anyway i guess [13:59:09] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4046622 (10Ottomata) @Krinkle @Imarlier, we were going to do this today, but thought we should... [14:00:36] 10Analytics-Data-Quality, 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Please review: new WMDE public data set on stat1005 - https://phabricator.wikimedia.org/T187606#4046624 (10mforns) Hi @GoranSMilovanovic :] The data set is good for publishing now. Thanks for applying the... [14:01:22] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4046627 (10elukey) Question number 2 is also how do you guys prefer to coordinate for the swit... [14:12:30] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4046639 (10Pchelolo) > Add a second LVS IP, to be served from the same cluster, to use for videoscaling. This will guarantee we evenly... [14:13:58] o/ joal [14:14:03] Hi halfak [14:14:26] Seems like we're getting a little behind for our OpenSym efforts. [14:14:38] elukey: for roundrobin mirror maker, i wonder if we should not touch the analytics one [14:14:41] its working as is [14:14:41] halfak: Man - Too many thngs opened :( [14:14:45] and we will eventually remove it? [14:14:54] joal, should we pull the ripcord? [14:15:02] halfak: I do thank you a lot for the effort you put in writing that plan [14:15:30] No problem. It's not uncommon to have a few false starts on the way to a good paper :) [14:15:31] ottomata: I'd apply the change in there too, it should be a big deal [14:15:38] plus it is a big win [14:15:39] halfak: Yes - It's safer [14:15:54] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#4046641 (10diego) @JAllemandou : as we discussed on IRC, could you please add the timestamp for each revision? Also it would be good to have the data partitioned by revision_id, because this would make easer futures joins to get a... [14:15:56] OK I'll let Stuart know that we're backing out. [14:16:00] hmm elukey i agree, and kinda want to do it, but i might have thought you wouldn't have since you are the safer guy! [14:16:10] 'don't touch a working system' :p [14:16:19] especially one we will turn off in a few months [14:16:24] I hope you don't mind me pulling him in. He's got a lot of good stuff (unpublished) to say about the role of statistics in distributed governance. [14:16:30] halfak: 2 weeks for a paper from a non-writer person -- How pretentious was I! [14:16:32] it would make me feel better about the processes there though [14:16:35] I hope we can work with him again in the future [14:16:41] joal, not at all! [14:16:42] just knowing they were more balanced :) [14:16:47] halfak: Indeed his comments were super good ! [14:17:00] AFAICT, 99% of papers are mostly written in the last couple weeks before a deadline. [14:17:09] And everyone is a non-writer-person until they do it a few times :) [14:17:26] Not to say you haven't written, but maybe not this type of paper [14:17:27] halfak: I hear that - And writing is a habit ... I should get into that if I want to actually do it at some point :) [14:17:53] Maybe we should blog more about the importance of analytics tools for Wikimedians :) [14:18:18] E.g. let's interview some of the folks who are excited, interested, concerned, etc. about wikistats 2.0 [14:18:24] halfak: Many thanks again, I'll continue to think about the arguments discussed in the docs, and possibly how to organise my techincal aspects around them :) [14:18:34] :) [14:18:38] halfak: That would make a lot of sense [14:18:40] :) [14:18:56] Ooh. I wonder if we could do this around PAWS and Quarry. I'ma bug bd808 [14:19:48] halfak: in that direction, I do really hope we'll get to better ananlytics tools for labs 'soon' [14:19:53] elukey: gonna merge it for analytics too :) [14:21:00] joal, right on. It always helps to develop our arguments. (1) so that we can prioritize the work and (2) so we know what work is most important too! [14:21:02] ottomata: +2 from me :) [14:21:22] Maybe I could do some casual interviews at the hackathon. [14:21:24] just added the RPC metrics for namenode/resource-manager on an1001/1002 [14:21:28] * halfak scratches chin [14:21:29] will add them to grafana soon [14:21:34] joal, gonna be at the hackathon? [14:21:45] halfak: nope - Baby year - I stay quiet mostly [14:22:18] halfak: I should have more travelling time next year [14:22:23] !log bouncing mirrormaker for main -> analytics on kafka101[234] to apply roundrobin [14:22:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:22:45] Boo. OK :) I'll be in Paris for OpenSym. Maybe you'll be interested in attending or I could find an opportunity to travel to you. :) [14:23:18] halfak: If you have some time and fancy a trip in the countryside of France - Please come and visit!!! [14:23:35] * halfak wants to bike up a proper mountain. [14:23:49] halfak: If not feasible, I'll see if I can organise myself to be in Paris at OpenSym time [14:24:07] halfak: I live close to the sea - No mountain if you come near me :) [14:24:44] I guess I can cope ;) [14:24:56] :) [14:25:21] halfak: Let's organise to meet at hat time - Sounds great :) [14:25:33] \o/ sounds good. [14:28:19] Thanks again a lot halfak :) [14:29:21] halfak: proper mountain is just https://en.wikipedia.org/wiki/Andes or Everest [14:30:28] again [14:30:31] halfak: proper mountain is just https://en.wikipedia.org/wiki/Andes or Everest :P [14:32:06] !log bouncing MirrorMaker on kafka1023 (main -> jumbo) to re-apply acks=all [14:32:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:32:24] ha. I live in glacier land. To me a proper mountain is anything steeper than a big pancake. ;) [14:46:49] ottomata: the first puppet run on the new hadoop worker nodes seems problematic. I need to run manually apt-get update to pick up the cdh repos, and /etc/hadoop is not created before /etc/hadoop/jmx_exporter... [14:47:06] ? [14:47:12] i think maybe i had the apt-get update problem [14:47:17] but not any others [14:47:21] elukey: this looks worrysome to me [14:47:21] https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?from=now-1h&to=now&orgId=1&var-instance=main-eqiad_to_analytics&refresh=30s [14:47:33] after bouncing analytics MMs, the consume/produce rate is half? [14:47:38] but, lag is not increasing? [14:47:57] dunno what is up with those '6 laggiest partitions', they seem static, and there are simliar ones for jumbo [14:49:56] so the laggiest partitions were the same also before the switch no? [14:50:22] kafka1014 is not doing much though [14:50:53] so now with two consumers the rate is lower? [14:51:12] a-Team: I'm gonna miss standup because of US time change - I need to get Lino from the creche at that time [14:51:12] 3 consumers, kafka1014 has topics assigned, but just low volume ones i think [14:51:16] sending e-scrum [14:51:19] yeah, laggiest partitions are the same [14:51:20] and [14:51:23] it is def replicating [14:51:35] i get messages for e.g. refreshLinks topic [14:52:26] ottomata: I have an interesting side effect with SQL-converter [14:52:35] joal: oh ya? [14:53:39] ottomata: For structs defined in both schema, if an original row has this struct NULL, it'll end up with a struct populated with NULLS in the converted DF [14:53:44] Nothing I can do about that [14:53:48] ottomata: --^ [14:54:22] joal: you mean [14:54:30] orig: structfield: NULL [14:54:30] vs [14:54:44] converted: structfield { f1: NULL, f2: NULL}, ? [14:55:06] correct ottomata [14:55:10] that seems fine [14:55:19] that's what gets inserted anyway [14:55:46] !log bouncing MirrorMaker on kafka1022 to re-apply acks=all (main -> jumbo) [14:55:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:55:48] Since I have explicit field-retrieval with SQL, it explodes the struct [14:56:22] The other side works as well: For a new field being a struct, it'll be populated as NULL and not STRUCT(NULL ... ) [14:56:30] ottomata: Details [14:58:11] ohhh elukey i know why it shows laggies partitions [14:58:18] they are reported per mm instance in the name [14:58:37] and when the instance was rebalanced, the metric with that key stopped being added [14:58:46] and grafana is graphing missing data a no change [14:58:56] yes yes they are more but only because of more metrics, the rate seems the same [14:59:05] will change it to null as 0 [14:59:23] the thing that I don't get is the throughput [14:59:57] how is it possible that three consumers are performing less than one? [15:07:36] !log bouncing MirrorMaker on kafka1020 (main -> jumbo) to re-apply acks=all [15:07:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:12:47] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4046777 (10Imarlier) @Ottomata - should be today, in testing over the weekend I found an issue... [15:13:27] !log bounce main -> analytics mirror maker instances [15:13:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:20:25] a-team are we having retro today? [15:21:02] I don't know, ottomata seemed really interested in having retro! [15:24:33] haha [15:24:37] nawww no way, -2 folks [15:24:45] BOSS MAN SAYS NO FEELINGS [15:43:58] 10Analytics-Kanban, 10User-Elukey: Reduction of stat1005's disk space usage - https://phabricator.wikimedia.org/T186776#4046952 (10leila) @flemmerich Please check this thread. [15:44:15] xD [15:54:33] * elukey coffee [15:58:08] 10Analytics, 10Analytics-Cluster, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move EventLogging analytics processes to Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T183297#4047029 (10Ottomata) > it knows how to catch up Sooo, this is a funky moment. We are doing s... [16:01:12] npm for the win ! https://pbs.twimg.com/media/DYBuu5WX4AA2sDa.jpg:large [16:07:49] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#4047059 (10JAllemandou) @diego : PR has been submitted (https://gerrit.wikimedia.org/r/#/c/361440/) Can you explain more about how you'd partition the data here? Given that revision_id is a unique row identifier (or almost, some n... [16:08:10] (03PS7) 10Joal: Add XmlConverter spark job [analytics/wikihadoop] - 10https://gerrit.wikimedia.org/r/361440 (https://phabricator.wikimedia.org/T186559) [16:13:35] (03PS1) 10Joal: Update Refine DataFrame converter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419217 [16:30:13] people as fyi gerrit is temp down [16:54:08] joal: o/ [16:54:12] Hi elukey [16:54:13] I just discovered dfs.datanode.failed.volumes.tolerated [16:54:42] now one thing that puzzled me in the past was that one disk failure caused the datanode to go kaput [16:54:51] but it seems that it is only the default behavior [16:54:58] because dfs.datanode.failed.volumes.tolerated = 0 [16:55:34] so in our case, a disk failure would simply cause some blocks to get replicated elsewhere [16:55:42] and the datanode would keep going [16:55:52] (with dfs.datanode.failed.volumes.tolerated=1 say) [16:55:57] am I crazy? [16:57:05] (this is also something that I've heard at the conference but the config param wasn't written) [16:57:17] elukey: I actually have not seen that param [16:57:21] Seems super interesting :) [16:57:32] "The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown." [16:58:31] we could even thing about dfs.datanode.failed.volumes.tolerated=2 or 3 [16:58:38] since we have 12 disks for each node [17:00:07] elukey: makes sense [17:06:37] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#4047263 (10diego) @JAllemandou : My understanding is that if you partition by a unique id, you sort by that key,and then all the contiguous ids are in the same partition, as explained here: https://hackernoon.com/managing-en,spark-... [17:18:23] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#4047311 (10JAllemandou) @diego: Spark-partitionning happens in Spark. There is no such thing as to "spark-partition" a dataset outside of spark. Let's discuss this in IRC or hangout, should be easier. [17:31:11] hey ottomata - patch pushed for you [17:31:41] saw joal looks goood [17:31:44] will look more in amin [17:36:46] ottomata: will add some tests, particularly for nested structures [17:43:41] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#4047393 (10diego) @JAllemandou : just for the record, in this case I meant the parquet partitions. See you in IRC [17:51:39] (03CR) 10Ottomata: Update Refine DataFrame converter (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419217 (owner: 10Joal) [18:03:45] 10Analytics, 10EventBus, 10Services (doing), 10User-Elukey: Kafka sometimes misses to rebalance topics properly - https://phabricator.wikimedia.org/T179684#4047477 (10elukey) The example that Marko mentioned for the RecordLintJob seems to have started around 2018-03-02 (as outlined below) and completely re... [18:04:04] ottomata: what do you think about adding dfs.datanode.failed.volumes.tolerated to the hdfs settings? [18:05:48] I'll send a code review tomorrow in case [18:05:54] oh yeah +1 elukey [18:05:56] that sounds good [18:06:01] no reason to fail whole thing if a single disk fails [18:06:12] yep exactly, maybe 2/3 disks ? [18:06:12] don't even know what to put that at [18:06:13] 3? [18:06:14] yeah [18:06:15] 2? [18:06:18] my thought too [18:06:21] super :) [18:06:26] all right logging off for today! [18:06:36] have a good afternoon/evening people :) [18:14:19] 10Analytics, 10Analytics-Cluster: Alert for Kafka MirrorMaker lag - https://phabricator.wikimedia.org/T189611#4047502 (10Ottomata) p:05Triage>03Normal [18:14:48] 10Analytics, 10Analytics-Cluster: Alert for Kafka MirrorMaker lag - https://phabricator.wikimedia.org/T189611#4047493 (10Ottomata) [18:14:53] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review, 10User-Elukey: Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo - https://phabricator.wikimedia.org/T189464#4047507 (10Ottomata) [18:49:34] holaaaaa [18:49:38] back online [18:49:41] wowow [18:53:25] ottomata 's reigh is over muahahaha [18:53:30] reign? [18:54:58] haha [18:59:13] (03CR) 10Nuria: "Thanks to @mforns for getting this CRed" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/418922 (owner: 10Cicalese) [19:01:26] (03PS24) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [19:15:42] madhuvishy: yt? [19:15:53] i'm trying to set up jupyterhub stuff on new notebook nodes [19:15:58] having some trouble with this custom proxy stuff [19:16:03] ottomata: hey [19:16:07] don't understand how nginx is involved [19:16:33] also https://github.com/yuvipanda/jupyterhub-nginx-chp says it is deprectaed [19:16:54] and the requirements specifies jupyterhub=1.0.0, which is a meta package [19:17:10] so when i rebuilt wheels for stretch, it updated some jupyterhub deps which are no longer compatible with that i think [19:17:32] yeah yuvi built it [19:17:32] jupyter==1.0.0 [19:17:47] do you know anything about this nchp and if I need it anymore? [19:17:49] and now there's a replacement [19:17:49] i don't really know what it does [19:17:51] yeah [19:18:24] seems like maybe its not necessary since people have to tunnel in anyway? [19:18:27] http://jupyterhub.readthedocs.io/en/latest/index.html has some intro into the proxy and hub architecture [19:18:47] dsaez: Heya [19:18:50] dsaez: you still here? [19:19:17] huh ok... [19:19:59] but still madhuvishy don't fully understand [19:20:12] why does the jupyterhub module require_package('nginx-extras') [19:20:40] it doesn't use it [19:21:07] there is this template https://github.com/yuvipanda/jupyterhub-nginx-chp/blob/master/nchp/templates/nginx.conf [19:21:45] but afaict that isn't referenced anywhere by either nginx confs or by nchp confs [19:23:56] ottomata: i'm unsure right now, it's been over a year and a half since I built it [19:24:14] aye [19:24:16] ok [19:24:21] erfffff [19:27:00] joal: is druid updated with new wikistats data? [19:27:04] cc fdans [19:27:06] yes m'am [19:28:35] fdans: see dashboard all widgets mention "january" i think we have more than 1 problem with dates [19:28:43] cc mforns [19:29:11] nuria_: Feb for me on most widgets [19:29:35] joal: ya, we have all dates create skewing all widgets [19:29:45] ottomata: I'm a little buried in goal work to get the dumps migration done now. I was gonna ping you myself for that since I need to get the stat dataset mounts moved over [19:30:03] ooo ya ok [19:30:05] fdans: i will work on it, problem is UX fail [19:30:07] sounds so fuuuun! :) [19:30:13] for me also february for all in the dashboard except for pageviews by country [19:30:19] nuria_: It actually makes me realise the Pageviews By Country widget has incoherent dates in regard to others [19:30:19] madhuvishy: how can I help? [19:30:34] ottomata: I was trying to test it though on stat1006 and not having much success, are there any holes poked to enable the nfs mount? [19:30:35] (also, just curious is CloudServices responsible for dumps hw?) [19:30:43] madhuvishy: yes [19:30:45] for dumps distribution alone [19:30:49] ah hm [19:30:56] yes there is a analytics vlan firewall [19:30:59] fdans: ping [19:31:03] no connections from analytics -> other prod networks allowed [19:31:06] unless a hole is poked [19:31:07] elukey: ottomata: Heads up that the coal change is unlikely to happen today. There appears to be an issue with the confluent-kafka python module (or underlying librdkafka), which prevents the process from exiting when it gets SIGINT/SIGTERM. Since that's bad news for operability, I'm looking in to switching off of that module. [19:31:09] nfs (for cloud vps and stat), web and mirrors [19:31:15] ottomata: right, thought so [19:31:15] hi analytical folks [19:31:19] I'll make a patch then [19:31:28] you gotta file a ticket and ask a network opsen, alex can help, maybe elukey even if he feels confident :) [19:31:44] could anyone point me to where someone might read EventLogging stats from the beta cluster? [19:32:10] ottomata: ah I see, okay [19:33:19] nuria_: I think we have conflicting approaches as to how to get last month [19:33:23] hmm, marlier that sounds interesting and strange, although we haven't really used confluent kafka for consuming i think [19:33:31] we only use it for producing afaict [19:33:39] so maybe we've never run into that problem [19:33:49] fdans: you do not know the month until you retrieve the data right? [19:33:54] fdans: data could be delayed [19:34:02] ejegg: what do you mean by 'stats', like the actual events? [19:34:29] ottomata: yes, whatever comes out the back end! [19:34:33] ejegg: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster [19:34:40] thanks! [19:35:03] fdans: so there is only 1 valid approach: "get the month you are showing from the data request you just did" [19:35:10] fdans: right? [19:35:32] ottomata: AFAIK it's specific to the consumer. Apparently the poll() method can be uninterruptable in some cases, and even if there's a timeout set, it sometimes doesn't actually do so. [19:35:50] Not much I can do about it, so I'm just going to switch libraries /shrug [19:38:02] hm aye [19:38:13] marlier: we mostly use python-kafka (kafka-python) for consuming atm [19:38:16] and we maintain debs for it [19:38:25] it is pretty up to date atm [19:38:41] i betcha it won't be hard to switch [19:38:59] Yeah [19:39:17] It's got a fugly interface, but I can make do [19:40:34] joal:not really, let's talk tomorrow please [19:40:42] dsaez: ok no prob [19:40:49] dsaez: Have a good evening :) [19:44:38] 10Quarry, 10Patch-For-Review: Quarry should refuse to save results that are way too large - https://phabricator.wikimedia.org/T188564#4012234 (10Halfak) I regularly run queries for ORES that return results in the range of 200k. These results are usually just a single numerical column and thus (probably) do no... [19:44:38] nuria_: yeah I guess the range in the model should be readjusted to the data returned by the api [19:45:01] nuria_: but without that change causing a refresh of the data [19:45:09] fdans: right, cause you do not know what data you are going to get, so yes, there are a few bugs there [19:52:21] 10Quarry, 10Patch-For-Review: Quarry should refuse to save results that are way too large - https://phabricator.wikimedia.org/T188564#4047767 (10zhuyifei1999) 05Resolved>03Open [19:56:24] (03CR) 10Nuria: "Thank you, did you tried to run the job with dynamic allocation to make sure it runs?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/418925 (https://phabricator.wikimedia.org/T184768) (owner: 10Joal) [19:56:29] 10Quarry, 10Patch-For-Review: Quarry should refuse to save results that are way too large - https://phabricator.wikimedia.org/T188564#4047778 (10zhuyifei1999) (Alternative: time limit the save to 60 seconds, kill it with a signal handler for SIGALRM) [20:03:00] 10Analytics, 10EventBus, 10Services (doing), 10User-Elukey: Kafka sometimes misses to rebalance topics properly - https://phabricator.wikimedia.org/T179684#4047789 (10Pchelolo) > This time the generation was 295 -> 301, spanning over 2h. How did it recover? ChangeProp restart? Yes, @mobrovac restarted it... [20:08:42] (03CR) 10Joal: "I did not test it before, but a few runs have already been executed (he 7 days from the bundle always has some reruns for the monthly one " [analytics/refinery] - 10https://gerrit.wikimedia.org/r/418925 (https://phabricator.wikimedia.org/T184768) (owner: 10Joal) [20:09:38] joal: yt? [20:09:43] Hi nuria_ [20:10:32] joal: i do not understand issues with perf in aqs with the " daily-pages-top-by-edits " [20:10:44] joal: was that several queries? [20:10:47] 10Analytics, 10EventBus, 10Services (blocked), 10User-Elukey: Investigate group.initial.rebalance.delay.ms Kafka setting - https://phabricator.wikimedia.org/T189618#4047817 (10Pchelolo) p:05Triage>03Normal [20:10:55] joal: do we have a top edited metric? [20:11:02] madhuvishy:  do you know why yuvi had to write his own built in nginx proxy thing, instead of using the nodejs one? [20:11:15] joal: we do not right? [20:11:21] nuria_: of course we do :) [20:12:07] joal: ok, but it is not in the ui, is it? [20:12:11] joal: me lost [20:12:12] not yet in the ui [20:12:20] I 've just double checked [20:12:56] nuria_: I actually don't know why - probably because of lack of time to make it happen (+ table for tops wass not implemeneted at the beginning_) [20:13:09] joal: I see, "top-by-edits" [20:13:25] Ah, and also because we return page_id, not page_title, therefore making it more difficult to disaply niocely [20:13:31] nuria_: --^ [20:13:42] I recall that last bit is the one [20:14:12] 10Analytics: Add wikistats metric about "pagecounts" - https://phabricator.wikimedia.org/T189619#4047832 (10Nuria) [20:15:17] 10Analytics: Add wikistats metric "top-by-edits" - https://phabricator.wikimedia.org/T189620#4047844 (10Nuria) [20:16:04] nuria_: makes more sense? [20:16:08] 10Analytics: Add wikistats metric "top-by-edits" - https://phabricator.wikimedia.org/T189620#4047858 (10Nuria) Note thi s returns return page_id, not page_title, therefore making it more difficult to display nicely [20:16:20] joal: ok, now i understand [20:16:24] cool :) [20:19:15] 10Analytics, 10EventBus, 10Services (next), 10User-Elukey: Enable controlled debug logging for change-prop - https://phabricator.wikimedia.org/T189621#4047864 (10Pchelolo) p:05Triage>03Normal [20:19:48] ottomata: are you here? [20:20:00] ottomata: I have cleaned a bit the SQL-caster thing [20:20:23] Updating the CR now ottomata [20:21:16] ya [20:22:36] (03PS2) 10Joal: Update Refine DataFrame converter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/419217 [20:23:44] 10Analytics: AQS edits APi shoudl not allow queries without time bounds - https://phabricator.wikimedia.org/T189623#4047895 (10Nuria) [20:26:07] Oh nuria_ by the way - I've learnt a new expression today (thanks halfak) - We'll pull the ripcord for the OpenSym article [20:26:18] "ripcord" [20:26:22] joal: ahem... [20:26:25] nuria_: While the idea is cool and all, I've not invested enough time for this to work [20:27:10] joal: what idea, wait? [20:28:42] nuria_: a paper about AQS - I talked a bit with halfak when starting and sent him the plan - He helped setting up an argument for "Why analytics number are important for communities" [20:29:09] nuria_: Following down that road, maybe one day we'll write it :) [20:30:30] nuria_: makes sense? [20:31:01] joal: ah ya, but are still writing the one for the upcoming conference? [20:31:19] That wass the one nuria_ - Deadline is two days [20:31:30] And I didn't make enough progress [20:34:48] ottomata: saw ping on services, responding here - i think at that time the node one was very buggy and Yuvi hated it [20:34:59] joal: deadline for ABSTRACT though right? [20:36:07] joal: not for the FULL paper correct? [20:36:32] nuria_: I don't think so (but I proof read to be fully sure) http://www.opensym.org/2018/02/28/opensym-2018-call-for-papers/ [20:36:57] nuria_: I think it's full paper [20:37:06] hats why it's too close :) [20:38:59] joal: ah i see, indeed it is a very academical conference that seems to call for the paper whole [20:39:47] 10Analytics: AQS edits API should not allow queries without time bounds - https://phabricator.wikimedia.org/T189623#4047942 (10Reedy) [20:40:23] nuria_: I might be willing to try a more tech one first :) [20:47:37] !log restarting eventlogging processors to pick up VirtualPageView blacklist from eventlogging-valid-mixed topic [20:47:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:49:04] (03CR) 10Joal: Add by-wiki stats to MediawikiHistory job (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) (owner: 10Joal) [20:50:52] (03PS25) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [20:50:54] (03PS8) 10Joal: Add by-wiki stats to MediawikiHistory job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) [21:12:16] 10Analytics-Kanban, 10Google-Summer-of-Code (2018): [Analytics] Improvements to Wikistats2 front-end - https://phabricator.wikimedia.org/T189210#4048036 (10mforns) [21:13:38] (03PS1) 10Zhuyifei1999: [WIP] Revert "worker: refuse to save if rowcount is > 65536" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/419273 [21:14:14] 10Analytics: Check wikistats numbers for agreggations for all wikipedias - https://phabricator.wikimedia.org/T189626#4048040 (10Nuria) [21:15:09] (03Abandoned) 10Zhuyifei1999: [WIP] Revert "worker: refuse to save if rowcount is > 65536" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/419273 (owner: 10Zhuyifei1999) [21:18:23] 10Analytics: Check wikistats numbers for agreggations for all wikipedias - https://phabricator.wikimedia.org/T189626#4048053 (10Nuria) [21:19:19] 10Analytics-Kanban, 10Google-Summer-of-Code (2018): [Analytics] Improvements to Wikistats2 front-end - https://phabricator.wikimedia.org/T189210#4048055 (10mforns) Hi @sahil505, thank you very much for applying for the project! I've added a short project description to this task, and put some links of interest... [21:26:46] Gone for tonight a-team - See you tomorrow [21:26:54] byeeeeee :] [21:27:10] By the wya - Thanks for your review on stats mforns :) [21:27:39] joal, that code looks really good! looking forward to see those stats :] [21:47:17] (03CR) 10Mforns: [V: 031 C: 031] "LGTM! +1 only, because I see Nuria wanted to add some more comment?" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/417476 (https://phabricator.wikimedia.org/T189266) (owner: 10Fdans) [21:48:07] nuria_, I reviewed fdans's patch https://gerrit.wikimedia.org/r/#/c/417476/ and only +1 because I saw a comment of yours from Mar 9th saying something was missing? [21:49:35] (03PS1) 10Zhuyifei1999: worker: change to SIGALRM-based limit instead of row [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/419317 (https://phabricator.wikimedia.org/T188564) [22:05:42] (03CR) 10Mforns: [C: 032] "LGTM! Feel free to merge on my side!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/415255 (https://phabricator.wikimedia.org/T155507) (owner: 10Joal) [22:11:58] mforns: ya, actually quite abit is missing [22:12:21] mforns: for example any single new date() we do is actually setting date to user's tz [22:13:08] mforns: also last month is calculated from actual date rather than from data coming back from API, thus if data is delayed [22:13:39] mforns: like just happened, teh "label" on the data is wrong, it said February when data had not yet arrived [22:16:12] nuria_, I see [22:17:18] mforns: will work on that myself as time allows i think is going to be a regut [22:18:08] (03CR) 10Nuria: "Do you have a follow up patch deployed to wikistats-canary?" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/416999 (owner: 10Fdans) [22:18:32] nuria_, OK [22:22:35] 10Analytics, 10ChangeProp, 10EventBus, 10Services (next), 10User-Elukey: Enable controlled debug logging for change-prop - https://phabricator.wikimedia.org/T189621#4048185 (10mobrovac) [22:24:06] 10Analytics, 10EventBus, 10Services (blocked), 10User-Elukey: Investigate group.initial.rebalance.delay.ms Kafka setting - https://phabricator.wikimedia.org/T189618#4048200 (10mobrovac) [22:24:09] 10Analytics, 10EventBus, 10Services (doing), 10User-Elukey: Kafka sometimes misses to rebalance topics properly - https://phabricator.wikimedia.org/T179684#4048201 (10mobrovac) [22:24:12] 10Analytics, 10ChangeProp, 10EventBus, 10Services (next), 10User-Elukey: Enable controlled debug logging for change-prop - https://phabricator.wikimedia.org/T189621#4047864 (10mobrovac) [22:32:03] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create an LVS endpoint for jobrunners on videoscalers - https://phabricator.wikimedia.org/T188947#4048217 (10mobrovac) >>! In T188947#4046639, @Pchelolo wrote: >> Add a second LVS IP, to be served from the same cluster, to use for v... [22:58:59] YouTube CEO announced YouTube is going to start linking to Wikipedia articles on hoaxes and conspiracies on videos about those things https://www.wired.com/story/youtube-will-link-directly-to-wikipedia-to-fight-conspiracies/ I wonder if we should add YT as a referrer class when we refine requests [22:59:13] bearloga: WOW [23:00:39] bearloga: human curation of garbage might be in order too (for YT not us), actually super interesting problem [23:22:28] 10Analytics, 10Analytics-Wikistats: The alert message about adblocker is not fully shown on smaller screens - https://phabricator.wikimedia.org/T188208#4048353 (10sahil505) There are many ways to go about solving this bug. One is to provide a uniform left and right margin and adjust the class height a bit. Loo... [23:53:46] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Should it be possible for a schema to override DNT in exceptional circumstances? - https://phabricator.wikimedia.org/T187277#4048445 (10Jdlrobson)