[03:09:20] PROBLEM - Hadoop DataNode on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [07:50:28] 10Analytics-Tech-community-metrics, 10Gerrit, 10Upstream: Gerrit patchset 99101 cannot be accessed: "500 Internal server error" - https://phabricator.wikimedia.org/T161206#3124611 (10TerraCodes) >>! In T161206#3942614, @Paladox wrote: > This is not resolved in 2.14. And most likly won't as it probaly hit a b... [08:14:57] Hi elukey - I'm genty waking up - Let me know when you want us to start [08:25:14] hello joal! [08:27:05] checking an1057 [08:28:21] Operation category WRITE is not supported in state standby. [08:28:23] elukey: Thanks for the night watch [08:28:48] :) [08:29:54] I checked the pager before going to bed and two node managers down at the same time smelled like something bad happening [08:30:30] elukey: right [08:30:32] 1057 registered a ton of "INFO org.apache.hadoop.hdfs.server.datanode.DataNode: reportBadBlock encountered RemoteException for block" [08:30:41] weird [08:31:01] RECOVERY - Hadoop DataNode on analytics1057 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [08:31:29] something is not right [08:31:30] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&from=now-2d&to=now [08:31:38] pending replication blocks is too high [08:32:40] elukey: See the trend in under-replicated blocks - We'll ptrobably have to wait [08:34:01] PROBLEM - Hadoop DataNode on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [08:35:10] it might be analytics1057 not coming up, it is the only thing not working afaics [08:35:18] elukey: I think it is [08:35:35] elukey: It'll probably solve faster when an1057 gets back up [08:36:23] the problem is that it doesn't come up :D [08:37:08] :( [08:37:09] ahhh some hard disk failed [08:37:20] one precisely [08:37:26] okok now I get what's happening [08:37:42] [2834806.964059] sd 0:2:4:0: [sdg] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [08:37:45] [2834806.964070] sd 0:2:4:0: [sdg] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 [08:42:25] crap [08:42:39] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3966288 (10elukey) [08:45:50] there you go --^ [08:46:02] I systemctl masked all the hadoop units on an1057 [08:46:58] elukey --explain ? [08:47:06] hahaha sorry [08:47:09] :) [08:49:02] so systemctl mask is a super awesome command that basically moves the systemd unit for a daemon to /dev/null [08:49:14] so any attempt to start/restart/etc.. cannot happen [08:49:31] it is a smart blacklist [08:49:47] and I don't need to disable puppet [08:49:55] Oooooh ! So, when we'll work on every node, an1057 will get the updates, but will not try to start ! [08:50:17] exactly [08:50:44] now I don't love the idea of having pending replication blocks [08:50:53] elukey: Let's wait [08:50:55] +;2~1 [08:52:14] elukey: It'll take approximately 7hours to complete from now [08:53:49] joal: I am wondering if it will ever return to zero, since an1057 is down now and it will stay in that way until we get a new disk [08:54:54] elukey: it will for sure - blocks in an1057 are currently under-replicated (there 3 instances of each of them, with one on 1057) - Now HDFS is using the 2 that are left to put another copy somewhere else [08:55:25] joal: ah ok so HDFS does that automatically? [08:55:55] okok makes sense then [08:56:04] elukey: From the trend I see in "HDFS under replicated blocks" chart, I'd say yes [08:56:23] elukey: I think HDFS is configured to wait for some long time (a few hours?) before starting doing so [08:56:30] elukey: But we should triple check [08:57:43] Wow elukey - I hadn't realized HDFS was still so full :( [08:58:01] elukey: looks like most machines are 90% full [09:02:18] yeah :( [09:16:06] PROBLEM - Host analytics1062 is DOWN: PING CRITICAL - Packet loss = 100% [09:16:09] lol? [09:16:12] crap [09:16:17] crap crap crap [09:16:28] elukey: I have a bad feeling about HDFS [09:16:54] elukey: too full - Too much disk usage on our machines - It might get to problematic if nodes start collapsing [09:17:33] joal: we still have a reasonable amount of space and it is not an immediate threat right? [09:17:57] an1062 is not reachable via ssh [09:18:01] elukey: everytime a node fails, we lose 1/40th roughly of space [09:18:11] And we are 90% full [09:20:18] yep yep but we are reasonably ok for the moment [09:20:31] so an1062 is completely frozen from the serial console [09:20:32] elukey: are you on the analytic1062 mgmt? it says Serial Device 2 currently in use for me [09:20:38] ah. ok :-) [09:20:41] moritzm: I am yes :D [09:21:00] elukey: here is my intuition -- http://perdstontemps.ca/wp-content/uploads/2013/07/dominoes-of-dominoes-falling-9893.gif [09:21:23] moritzm: the hadoop cluster doesn't want java 8 [09:21:37] yeah :-) [09:21:53] joal: thanks for the being optimist :D [09:22:03] sorrrrrrrry :/ [09:22:32] elukey: I'm here to help - let me know if there's anything I can do - [09:24:22] joal: ahahha [09:24:28] I am joking of course [09:24:32] so an1062 is booting [09:25:20] all good, everything up [09:26:46] not really good, though :-) there's no sign of why it crashed in logs [09:27:36] moritzm: it seems a random failure, didn't check the racadm log though [09:28:05] Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2. [09:28:10] ahhaha [09:28:16] seriously? [09:28:50] yeah, memory errors [09:29:03] and also DIMM_B1 [09:29:12] although, no that was in 2017 [09:29:18] it seems that the first log was in 2017 though [09:29:20] yeah [09:29:46] maybe the dimm caused this freeze ? [09:29:51] it would kinda make sense [09:30:16] in anycase, going to open a task to swap it [09:30:19] yeah, although these should have ECC, so it must be really broken [09:32:35] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#3966345 (10elukey) [09:32:44] there you go [09:33:12] joal: nice https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=41&fullscreen&orgId=1&from=now-6h&to=now [09:33:24] so we are back to recovering from an1057 [09:34:17] elukey: right - If we wait for block recovery, I think it'll be too late to migrate to j8, and tomorrow I can't start before 17:00 :( [09:34:32] pffff [09:35:41] joal: why too late? We have to just send an email and alert people, the downtime should be relatively small (1h probably) [09:36:25] elukey: At the current rate, it'll have recover at about standup time [09:37:00] elukey: I'm super fine doing it a that time, I just don't want to push you working at night :) [09:38:03] joal: Andrew will also be there so in case I can handover to him, I am not looking forward to delay this upgrade but it is also something that we cannot postpone again :D [09:38:20] very much agreed elukey [09:38:40] I wonder about moving before blocks recovery :( [09:39:43] joal: I think it will be fine, but better not to risk [09:50:31] PROBLEM - HDFS corrupt blocks on analytics1001 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=39&fullscreen [09:51:19] this alarm might be too sensitive [09:51:24] but glad that it works [09:51:35] 27 blocks corrupted, super fine [09:51:41] this was right after the 1062 failure [10:08:05] elukey@analytics1001:~$ sudo -u hdfs hdfs fsck / -list-corruptfileblocks [10:08:08] Connecting to namenode via http://analytics1001.eqiad.wmnet:50070 [10:08:10] The filesystem under path '/' has 0 CORRUPT files [10:10:38] same thing even with -openforwrite [10:10:49] so the stats might be lagging a bit [10:13:17] 10Analytics, 10TCB-Team, 10Two-Column-Edit-Conflict-Merge, 10WMDE-Analytics-Engineering, and 5 others: How often are new editors involved in edit conflicts - https://phabricator.wikimedia.org/T182008#3966426 (10GoranSMilovanovic) #analytics-eventlogging Hey, please check out T182008#3966356. Background:... [10:53:38] joal: https://community.hortonworks.com/articles/4427/fix-under-replicated-blocks-in-hdfs-manually.html seems interesting [11:07:29] wow joal https://ofirm.wordpress.com/2014/01/18/exploring-hdfs-block-placement-policy/ [11:07:59] the default block placement was completely not intuitive to me [11:08:34] I checked /user/elukey/.staging/job_1456242175556_10078 and some blocks are indeed replicated into two racks [11:09:04] my assumption was that we had three different racks for each block [11:09:32] elukey: mitigating inter-rack trafiic with rack failure [11:11:08] joal: sure, but a rack failure can also mean replica for a block down to 1 [11:11:33] elukey: for a block down to RF-1, any failure can mean dataloss [11:11:56] ? [11:12:21] well, rack failure is problemati for hadoop because it usually invoves multiple machines failling at the same time [11:12:45] If you have RF-1 for a block, whether rack failure or just single host failure, you lose data [11:13:46] elukey: looks like I don't make sense, or don't understand :) [11:15:06] ah sure RF-1 == replication factor? [11:15:42] but I meant in our case: we have RF-3, but a rack failure can get to a block having only one replica [11:15:56] it will be put in high position of the replication priority queue [11:16:02] buuut still [11:16:10] anyhow, I don't make sense to mysel either [11:16:18] today is Hadoop discovery day [11:16:21] :D [11:18:57] elukey: Makes sense! [11:19:50] elukey: I assume that the mitigation of inter-rack network price vs rack-failure probability has been studied empirically - but who knows ! [11:26:00] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3966645 (10fgiunchedi) p:05Triage>03Normal [11:26:08] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#3966646 (10fgiunchedi) p:05Triage>03Normal [11:42:48] !log force kill of yarn nodemanager + other containers on analytics1057 (node failed, unit masked, processes still around) [11:42:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:51:03] (03PS1) 10Joal: Add dataframe conversion to new schema function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/410154 [11:52:15] \me ottomata[m] will like that --^ [11:52:26] hm - sorry - again [11:52:35] * joal hopes --^ [11:57:18] (03CR) 10jerkins-bot: [V: 04-1] Add dataframe conversion to new schema function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/410154 (owner: 10Joal) [12:16:04] heloooo [12:16:37] Hi mforns [12:51:00] 10Analytics, 10Research: Mount XML dumps on stat1004 - https://phabricator.wikimedia.org/T187178#3966846 (10diego) [12:51:42] hey dsaez, all good with your rsync? [12:55:13] * elukey lunch! [12:55:17] (will read laterz :) [13:12:11] RECOVERY - HDFS corrupt blocks on analytics1001 is OK: OK: Less than 60.00% above the threshold [2.0] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=39&fullscreen [14:00:27] gooood --^ [14:02:26] HEYYYYY how goooes it?! [14:02:34] hello :) [14:03:06] we didn't start, analytics1057 failed due to a broken disk and analytics1062 (iirc) froze for dimm issues [14:03:36] Under replicated blocks is still kinda high https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&from=now-2d&to=now&panelId=41&fullscreen [14:03:40] so we thought to wait [14:04:25] geeez [14:04:30] just on its own? [14:04:36] just coincidence timing? [14:04:56] elukey: doing hw budget stuff, i see you in there :) [14:05:04] i guess analytics1001-1003 need replaced too [14:05:17] Q: we kinda talked about getting a a failover for analytics1003 db too, right? [14:05:36] do you think we could just put a slave replica on analytics1002 (replacement)? [14:05:40] and still do this with just 2 nodes? [14:06:05] hmm, i guess we want a place that we could potentially run all the daemons too? [14:06:08] hive, oozie, etc? [14:06:38] elukey: yep! rsync working perfect. Now I need the XML dumps on that machine :D https://phabricator.wikimedia.org/T187178 [14:08:08] ottomata: on its own, separate failures :D [14:08:24] the under replicated blocks are surely due to an1057 being down [14:08:51] just answered to the zookeeper question in the sheet! [14:09:59] about the an1003 failover - I'd love to have a complete replica of an1003 somewhere else, to use it as hot standby when needed [14:10:09] but it might be a ganeti host [14:11:03] we could use ganeti for the services [14:11:08] not sure i'd want to for themysql part [14:11:10] hmmm [14:11:22] elukey: we could just get replaements for an01 and an02 [14:11:40] and run mysql on those, with an02's mysql as the normal mysql master, and an01 as the standby slave [14:11:41] ? [14:12:00] load is usually very low, even on an01 [14:13:21] ottomata: we could yes, but my paranoia level would be high due to an1001's importance [14:14:20] hmm, i guess we might use a standby to somehow to a failover so we can do maintenance? [14:14:24] would we do that? [14:14:29] hive/oozie failover is probably not easy [14:15:46] hive/oozie failover would be a pain probably for not a big gain, I'd concentrate on two use cases [14:16:07] 1) mysql down on an1003, Druid complaining [14:16:36] 2) analytics1003 broken, mysql + all daemons have no place to run [14:16:57] well, if 1) mysql down, all daemons will complain [14:17:04] sure yes :D [14:17:16] I was specifically referring to Druid since it is kinda important for us [14:17:47] i think that 1) is unlikely to happen wtihout 20 [14:17:49] 2) [14:17:56] so [14:18:04] you'd prefer if we had a failover coordinator box then? [14:19:26] yep that would be great, but mostly to have a mysql replica [14:23:39] elukey: have we heard from jaime about budgeting for dbstore replacements? [14:24:16] hey ottomata/elukey what do you guys think would be a good location for the EL whitelist in HDFS? [14:24:18] I am pinging them now [14:25:25] hm, mforns maybe we should maintain it in refinery like we do other things like this? [14:25:28] we have the project list in there, right? [14:25:33] should we put whitelist in there too? [14:25:34] not sure. [14:25:45] aha [14:25:46] maybe in refinery is less flexible, since it isn't as easy to deploy that change as puppet? [14:26:55] ottomata, yes, I think ease of deployment of puppet is a plus [14:27:25] hm, its still going to be a pain though [14:27:33] because puppet is going to have to put it in hdfs [14:27:34] which is not easy [14:27:44] and ottomata, the coolest would be if we had just one whitelist, no? [14:27:55] yes [14:28:00] and puppet could ensure its presence in both locations [14:28:06] ottomata: yep we need to order dbstore1002's replacements [14:28:12] but mysql flattens... i guess the mysql purger could flatten the fields [14:28:24] instead of the whitelist being flattened itself [14:28:59] ottomata: also please remove the zookeeper nodes, I had a chat with Giuseppe and Faidon, it is probably not worth to proceed for this budget [14:29:20] ottomata, the way I structured it, the specific EventLoggingSanitization job reads the flat whitelist and builds a structures whitelist object, that is passed to the generic WhitelistSanitization core class [14:29:33] *structured [14:31:18] formatting the whitelist object is the main responsibility of EventLoggingSanitization.scala, we can do it there or else change both the whitelist and the purging script... [14:33:11] oh elukey really? [14:33:25] oof mforns sounds nasty [14:33:32] what if a field name actually is field_name? [14:34:00] ottomata, not sure I understand [14:34:12] how are you going to build the structured field name from the flattened field name? [14:34:24] let's say original is [14:34:29] event.my_field [14:34:32] that gets flattened to [14:34:34] event_my_field [14:34:42] EventLoggingSanitization.scala can assume that all whitelist fields that start with 'event_' belong to the event StructType [14:34:42] how do you go backwards? [14:34:49] HMmmmm [14:34:53] i dunno.... i guess, sounds gross [14:35:05] other than this [14:35:21] is there a technical reason to make eventcapsule specific logic? [14:35:30] might be nice if it didn't know anything about event capsule stuff [14:35:56] ottomata, the generic part, WhitelistSanitization.scala, does not know of capsule or EventLogging conventions [14:36:08] ah ok [14:36:08] that is the core of the job [14:36:11] hm [14:36:13] i see [14:36:15] ok [14:36:25] EventLoggingSanitization.scala is mere parameter parsing plus whitelist formatting [14:36:30] but, still, other than the fact that we already have this whitelist for mysql specific stuff [14:36:46] is there a reason to not maintain the whitelist with fields name as they actually are? [14:37:18] elukey: , removed. [14:37:23] no, we could change the whitelist (quite easily) but we would also have to change the mysql purging script [14:37:33] (not so easy...) [14:37:43] potentiall [14:37:44] y [14:38:38] mforns: i knwo that will be more work, but it sounds backwards currently [14:38:54] likely this hive purging stuff will outlast the mysql one [14:39:01] we have in annual plan to actually get rid of mysql [14:39:01] hopefully! [14:39:03] :] [14:39:32] might be better to fix the whitelist now, and make the specific flattening stuff happen for the mysql purger [14:39:32] yes, definitely at some point the whitelist format will change, so, no problem in doing it now! [14:39:42] rather than maintain a weird whitelist [14:39:47] sure [14:39:49] and special logic to transfor it back [14:39:54] I'll do that [14:39:59] ok coolk thank youUuU [14:40:09] :] [14:50:10] (03CR) 10Fdans: [C: 032] Remove Curaçao as a country [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/409155 (owner: 10Fdans) [14:50:24] (03CR) 10Fdans: [V: 032 C: 032] Remove Curaçao as a country [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/409155 (owner: 10Fdans) [15:07:28] (03CR) 10Ottomata: "One comment." (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/410154 (owner: 10Joal) [15:07:36] joal: you are magical [15:07:44] i actually tried to do this last night, and totally failed [15:07:56] but i was mostly working with the df and row, didn't get to the rdd much [15:07:57] wow [15:08:22] was trying to build fancy select statements with reordered fields and then renames [15:13:41] 10Analytics-Kanban: Small map UI changes - https://phabricator.wikimedia.org/T187205#3967632 (10fdans) [15:32:36] wow, what a dramatic morning for our hardware!!! [15:35:10] elukey: Back from having caught Lino at school [15:35:23] looks like under-replicated blocks are not yet ready [15:35:25] yeah [15:35:39] I'll take tha 1/2 hour before standup to make fire then :) [15:35:51] I'll be on this evening to make it happen :) [15:35:53] 10Analytics, 10Analytics-EventLogging, 10Performance: Spin out a tiny EventLogging RL module for lightweight logging - https://phabricator.wikimedia.org/T187207#3967721 (10AndyRussG) [15:37:47] (03PS1) 10Fdans: Release 2.1.8 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/410187 [15:40:32] (03CR) 10Fdans: [C: 032] Release 2.1.8 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/410187 (owner: 10Fdans) [15:40:51] (03CR) 10Fdans: [V: 032 C: 032] Release 2.1.8 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/410187 (owner: 10Fdans) [15:41:21] fdans: do you want me to force a puppet run? [15:41:38] elukey: no need, thank you luca :) [15:45:45] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3966288 (10Cmjohnson) megacli Enclosure Device ID: 32 Slot Number: 3 Drive's position: DiskGroup: 4, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 3 WWN: 5000c50080a382b5 Sequence Numb... [15:46:21] (03PS1) 10Fdans: Release 2.1.8 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/410190 [15:46:39] (03CR) 10Fdans: [V: 032 C: 032] Release 2.1.8 [analytics/wikistats2] (release) - 10https://gerrit.wikimedia.org/r/410190 (owner: 10Fdans) [15:51:00] 10Analytics-Kanban, 10Operations, 10ops-eqiad: Broken disk on analytics1057 - https://phabricator.wikimedia.org/T187162#3967804 (10Cmjohnson) Ticket created with Dell You have successfully submitted request SR960779440. [15:57:36] guys, is stat1005 dead? [15:57:57] -bash: fork: Cannot allocate memory [15:58:53] it was yes [15:58:57] the oom killer acted [15:59:22] dsaez@stat1005:~$ -bash: wait_for: No record of process 6285 [15:59:30] can't do anything [15:59:57] there is a heavy python process running [16:00:51] yes, 2 :) but still alive [16:02:04] oops if that's me [16:02:10] I can kill one :) [16:02:24] I was about to ping you miriam ! :P [16:02:37] :P got it [16:02:53] let's also try to nice them [16:04:57] elukey: sure, sorry :) tensorflow stuff [16:08:52] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#3966345 (10Cmjohnson) his will require troubleshooting, moving DIMMs around and waiting to see if the error returns. Do you see anything else in logs besides the idrac log? Dell will requ... [16:08:54] someday i should writeup how abuse pyspark to run arbitrary python stuff inside the hadoop cluster instead of fighting for the stat100X machines [16:09:12] ebernhardson: that would be so nice [16:16:27] SMalyshev: o/ - I tested kafkacat -L -b deployment-kafka-jumbo-1.eqiad.wmflabs:9092 on the host itself and it works fine [16:16:33] where did you test it? [16:19:11] SMalyshev: are you signed into a deployment-prep host? [16:19:15] when you run that? [16:21:32] smalyshev probably wont start for another 30-60minutes, but i imagine it would be from a test WDQS instance [16:22:06] he's been working on a wdqs updater that pulls from kafka instead of reading the recent changes api, so he would be reading kafka from that macihne [16:22:34] which is not in deployment-prep project? [16:22:43] that would explain why he can't connect [16:23:34] hmm, it should be in labs at least beyond that i'm not sure [16:24:08] s/labs/cloud/ [16:31:45] fdans, elukey : retrooooooo [16:36:12] oh joal there is a failure [16:36:37] java.io.FileNotFoundException: /usr/share/GeoIP/GeoIP2-City.mmdb (No such file or directory) [16:36:52] 19:56:13 *** 1 TEST FAILED *** [16:46:14] 10Analytics, 10Analytics-Wikistats: Contextualize wikistats metrics - https://phabricator.wikimedia.org/T187212#3968053 (10Nuria) [16:54:17] nuria: how does this work? https://gerrit.wikimedia.org/r/#/c/403916/8/refinery-core/src/test/java/org/wikimedia/analytics/refinery/core/TestMaxMindGeocode.java it doesn't pass a test maxmind db path. [17:01:34] ottomata: looks it up on teh jvm properties [17:01:48] ottomata: one sec [17:02:50] ottomata: https://gerrit.wikimedia.org/r/#/c/403916/8/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/maxmind/MaxmindDatabaseReaderFactory.java [17:03:20] nuria_: but in a test? [17:03:30] DEFAULT_DATABASE_CITY_PATH doesn't exist [17:03:34] (03PS2) 10Joal: Add dataframe conversion to new schema function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/410154 [17:03:51] ottomata: I can help with that [17:03:55] ottomata: batcave? [17:04:13] k [17:05:01] ottomata: mvn passes it on pom [17:05:06] ottomata: take a look [17:05:25] nuria_: we're in da cave, he now knows :) [17:05:30] joal: ok, good [17:07:29] (03PS3) 10Milimetric: [WIP] Saving in case laptop catches on fire [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408848 [17:09:24] (03CR) 10jerkins-bot: [V: 04-1] Add dataframe conversion to new schema function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/410154 (owner: 10Joal) [17:12:27] oh joal/nuria, can we merge the geocdoing + ISP patch? [17:13:21] ottomata: good for me [17:13:36] (03PS8) 10Ottomata: Add TransformFunctions for JsonRefine job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/407508 (https://phabricator.wikimedia.org/T185237) [17:14:19] (03PS3) 10Joal: Add dataframe conversion to new schema function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/410154 [17:18:18] going for diner - will be back at the hour for j8 upgrade [17:18:19] :D [17:21:57] (03CR) 10jerkins-bot: [V: 04-1] Add TransformFunctions for JsonRefine job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/407508 (https://phabricator.wikimedia.org/T185237) (owner: 10Ottomata) [17:22:09] (03CR) 10jerkins-bot: [V: 04-1] Add dataframe conversion to new schema function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/410154 (owner: 10Joal) [17:27:02] milimetric: do you have a min to explain to me how to stop report updater? [17:27:35] elukey: just disable cron / comment out the cron jobs [17:27:46] on thorium? [17:27:56] they're on a couple of different machines defined in puppet [17:28:33] all right! [17:29:04] (looking it up now, sorry on the phone and eating lunch at the same time)( [17:29:24] https://github.com/wikimedia/puppet/blob/d35b2a05729c1f78e0b3b85ccdf2a25fe3032694/modules/role/manifests/statistics/cruncher.pp [17:29:26] nono I can check them, didn't know where to look [17:29:39] 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Services (doing): Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3968338 (10Pchelolo) Vagrant is a bit specific. By default the `node-rdkafka` just clones librdkafka sources into a subfolder a... [17:29:45] https://github.com/wikimedia/puppet/blob/84159138cf6ff0b6c05928e579edc6422b552545/modules/profile/manifests/reportupdater/jobs/hadoop.pp#L42 [17:29:54] https://github.com/wikimedia/puppet/blob/84159138cf6ff0b6c05928e579edc6422b552545/modules/role/manifests/statistics/private.pp#L9 [17:30:01] I think just stat1005 and stat1006 [17:30:11] nothing on thorium, that's just serving static stuff [17:32:23] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968342 (10Nuria) mmm.. no data coming in... [17:36:15] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968352 (10bmansurov) When I visit https://research.wikimedia.org I see that the following URL is being pinged: ```https://piwik.wikimedia.org/piwik.php?acti... [17:38:43] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968359 (10Nuria) Argh, sorry, @bmansurov this needs to be patched on the header before closing tag , I missed that in in your CR, my apologies. [17:40:03] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968371 (10Nuria) mmm.. but regardless, it should still work even where it is , it will just miss some visits ... [17:47:01] perhaps known but i didn't see a related ticket, /mnt/hdfs is disconnected on stat1005 [17:48:04] yeah I am about to fix it ebernhardson [17:48:14] sadly sometimes it happens after oom killer or high load events [17:52:24] fixed [17:54:44] thanks! [17:55:57] 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Services (doing): Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3968402 (10Ottomata) > Vagrant is on Debian scratch and we don't have librdkafka 0.11.3 packaged for scratch You mean Stretch?... [17:56:25] ebernhardson: can I ask you a quick thing? [17:56:40] elukey: sure [17:57:09] I saw a Discovery Transfer To https://elastic1017.eqiad.wmnet:9243 job in yarn [17:57:21] elukey: yup, runs every monday for ~40 hours to 2 clusters [17:57:22] since we are about to shutdown the hadoop cluster for say 30 mins (java 8 upgrade) [17:57:45] elukey: not a big deal if it's canceled, its shipping weekly article popularity information which doesn't change *that* much week to week [17:57:58] I am also seeing MLR: training pipeline xgboost: enwiki :D [17:58:04] I'll alert you when we are ready [17:58:14] elukey: thats me right now, just testing something. Its also killable [17:58:38] super.. we are basically waiting for https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&from=now-1h&to=now&panelId=41&fullscreen [17:58:41] :D [17:59:09] thats getting pretty close :) I'll just cancel the mlr one now [18:01:35] we also need to prep [18:01:42] so don't worry, I'll alert you :) [18:02:22] 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Services (doing): Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3968426 (10Pchelolo) 05Open>03Resolved Agreed, that's what I've been thinking as well. So, resolving the ticket! [18:05:28] ottomata: I tried it from another labs host [18:05:37] not from the same host [18:06:22] I need it to work on another host, otherwise I'd have to install the whole blazegraph thing on deployment-kafka-jumbo-1 [18:06:58] SMalyshev: sure, which host? [18:07:06] SMalyshev: it will work from another host in the deployment-prep project [18:07:22] elukey: wdqs-test on wdqs projects [18:07:37] yeah, projects are siloed off form each other [18:07:40] they are separate networks [18:07:44] yeah [18:07:49] hmm... so how can I access it? [18:07:50] you can make a new node in the deployment-prep project [18:08:05] (sorry, in meeting) [18:08:11] ottomata: [18:08:11] elukey@krypton:~$ curl http://localhost:8100/v2/kafka/main-eqiad/consumer [18:08:14] {"error":false,"message":"consumer list returned","consumers":["change-prop-updateBetaFeaturesUserCounts","change-prop-wikibase-addUsagesForPage","change-prop-on_backlinks_update","change-prop-purge_varnish-resource_change","change-prop-ores_cache_2","change-prop-on_transclusion_update","change-prop-on_wikidata_description_change","change-prop-mobile_rerender-resource_change","change-prop-page_ [18:08:20] images_mobile","change-prop-RecordLintJob","change-prop-summary_definition_rerender-resource_change","change-prop-revision_visibility_change","change-prop-htmlCacheUpdate","change-prop-mw_purge","change-prop-flaggedrevs_CacheUpdate","change-prop-page_images_summary","change-prop-ores_cache_1","change-prop-null_edit","change-prop-page_create","change-prop-page_move","change-prop-page_edit","chan [18:08:22] hmm with full blazegraph install? [18:08:26] ge-prop-wikidata_description_on_edit","change-prop-page_delete"]} [18:08:29] \o/ [18:12:19] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Patch-For-Review, and 2 others: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3968458 (10elukey) ``` elukey@krypton:~$ curl http://localhost:8100/v2/kafka/main-eqiad/consumer {"error":false,"message":"consumer list returned",... [18:19:49] elukey: shall I suggest we stop camus? [18:21:02] joal: already did, together with report updater :) [18:21:06] the cluster is drained [18:21:19] ottomata: hellooooo [18:21:19] camus-webrequest_canary_test [18:21:26] we are about to start :P [18:21:45] joal: batcave? [18:22:31] elukey: IOMW [18:28:45] people we are about to start the upgrade [18:29:49] good luck! [18:30:12] thanks mforns ;) [18:32:01] ottomata, elukey: does kafka only use port 9092 or also other ports? maybe I can create a tunnel there... don't really want to create whole new blazegraph setup in different project if I can avoid it [18:39:01] SMalyshev: hey soryr, was in meeting [18:39:06] hmm, you might be able to get a tunnel going [18:39:09] its only port 9092 [18:39:17] you'd need a tunnel for all the brokers though [18:39:23] orr, hm [18:39:24] no? not sure. [18:39:25] anyway [18:39:30] hmm that won't work I only have one port [18:39:37] SMalyshev: deployment-pre == beta == testing/staging [18:39:47] for prod services [18:39:51] if WDQS is a prod service [18:39:57] you probably should set it up in deployment-prep anyway [18:40:10] esp if you want to use a maintained kafka cluster + eventbus events, etc. [18:40:22] yeah I do want it [18:40:24] you can set up kafka in your own project if you like, and then emit some fake test data do it [18:40:29] i can help you with that, it isn't so hard [18:40:36] Hi miriam - even if niced out, you runs on stat1005 are making the host complain (see ops channel) [18:40:54] ottomata: well ideally I'd want to have real thing accessible (production changes) but that's another big topic [18:40:57] miriam: PUT IT IN HADOOP DOOO IT [18:40:57] :) [18:41:09] hey joal! I see [18:41:18] miriam: Could it be possible for to tweak them so that they don't use the full number of cores of the machine but leave 1 or two for the system? [18:41:31] yeah SMalyshev btw, i opened https://github.com/Blizzard/node-rdkafka/issues/336 [18:41:31] joal: sorry about that , sure i'll do that! [18:41:41] if we get that we can get timestamp based eventstreams [18:41:45] ottomata: I don't want completely fake data I want to test with data that wikis produce... I tested already with fake data, but I fear that's not enough [18:41:47] but, there hasn't been any movement [18:41:58] SMalyshev: you could pull from eventstreams and produce to kafka [18:42:02] fake data are too clean, I can't create weird race conditions that happen on real [18:42:13] miriam: no big deal, just letting ou know :) [18:42:24] you just use revision-create, right? [18:43:00] ottomata: hmm... you mean set up kafka host that pulls eventstreams continuously and stores it in its own kafka? that may work though I am not sure how to do it. But maybe that's what I'll try next [18:44:06] ottomata: would that work on labs? I mean, if I needed to store like 2 weeks of changes for say wikidata, would labs be able to host it? [18:44:10] (03PS9) 10Ottomata: Add TransformFunctions for JsonRefine job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/407508 (https://phabricator.wikimedia.org/T185237) [18:44:25] SMalyshev: depends on how big your kafka box is [18:44:28] :) [18:44:43] there are limits on how big labs boxes can be... [18:44:47] don't do it to the deployment-prep box [18:44:52] but lets estimate! :) [18:45:24] https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?from=1518027173617&to=1518210669667&refresh=5m&panelId=32&fullscreen&orgId=1&var-cluster=analytics-eqiad&var-kafka_brokers=All&var-topic=codfw_mediawiki_revision-create&var-topic=eqiad_mediawiki_revision-create [18:45:28] e.g xlarge is 16 GB 160 GB [18:45:37] is 16G enough for kafka? [18:45:40] revision create is about 15Kbps [18:45:43] lets say you do all revision-reate [18:46:15] looks like 2 weeks of revision-create is about 18GB [18:46:23] right now it's four topics [18:46:29] oh? [18:46:34] not jsut revision-create? [18:46:47] mediawiki.revision-create, page-delete, page-undelete and page-properties-change [18:46:49] ah [18:46:58] page-delete, etc. are not in eventstreams [18:47:02] ottomata: quick check - Do you recall the UTF8 we had with java? [18:47:03] no reason they couldn't be, except no one wanted them yet... [18:47:08] hmm that may be a problem [18:47:25] joal: thanks a lot! ottomata: yes YES! :) [18:47:25] ottomata: Is there a way we can double check the same settings apply to j8? [18:47:43] SMalyshev: if you just need test data, you coudl also just grab it all from kafka or hadoop and download it and then upload it to a labs gox [18:47:46] labs box [18:47:49] and then produce that to kafka manually [18:48:12] miriam: same reques with TF and hadoop: Let's just make sure we don't burn all the machines cores ;) [18:48:27] SMalyshev: you are not the first one to want prod eventbus kafka data in prod though... [18:48:33] instead of eventstreams [18:48:33] hmmm [18:48:36] well the idea is to make it continuous, not just one-time... though I guess one time may work for this testing phase [18:48:53] you'll get continuous beta.wikimedia.org updates if you move your stuff to deployment-prep [18:48:54] but not prod data [18:49:17] is there a ticket for this already? its a tricky one, but often asked for [18:49:22] ottomata: I don't really need the whole prod data, what we already put in public (plus delete/undelete ones) is ok for me, but it's not in kafka, so not seekable [18:49:28] lemme use the ever useful phab search ...:p [18:49:48] kinda [18:50:06] (03CR) 10jerkins-bot: [V: 04-1] Add TransformFunctions for JsonRefine job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/407508 (https://phabricator.wikimedia.org/T185237) (owner: 10Ottomata) [18:50:13] halfak: also really wants this (i think) [18:50:22] https://phabricator.wikimedia.org/T161731 is about it [18:50:39] ah yes, ok, but for prod data in cloud vps hmm [18:50:43] kafka [18:50:58] i'm going to make a ticket, not that we will actually do it anytime soon, but it would be good to track use cases [18:51:05] so we have a case to make to ops to let us do it [18:51:07] so generally if I could store evenstream data in some kafka that would be publicly accessible and available for say 14 or 30 days back that'd work for me [18:51:19] PROBLEM - Number of banner_activity realtime events received by Druid over a 30 minutes period on einsteinium is CRITICAL: CRITICAL - druid_realtime_banner_activity is 0 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1 [18:51:19] cool [18:51:52] joal: I'll be suuper careful, promised. TensorFlow is a monster! [18:52:36] ottomata: so going back to labs thing, if we have 15 kbps just for revision-create... let's say other topics are less chatty, so maybe 20kbps overall. [18:53:28] that's 1.7G per day. for 14 days that's 24 G. May be ok for a labs host [18:54:27] ottomata: I understand we have both eqiad and codfw events on eventstream, right? [18:56:08] ottomata: then another concern. if I have this thing sitting on labs host and feeding from EventStream, what happens if that host reboots? Is there any way to resume from the point it used last? [18:56:59] SMalyshev: if you use an EventSources AKA SSE consuemr client [18:57:00] yes [18:57:06] it uses the Last-Event-ID header [18:58:18] https://wikitech.wikimedia.org/wiki/EventStreams#KafkaSSE and https://github.com/wikimedia/KafkaSSE#kafkasse [18:58:41] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services: Set up a Cloud VPS Kafka Cluster with replicated eventbus production data - https://phabricator.wikimedia.org/T187225#3968572 (10Ottomata) [18:58:55] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services: Set up a Cloud VPS Kafka Cluster with replicated eventbus production data - https://phabricator.wikimedia.org/T187225#3968584 (10Ottomata) I'm not suggesting that we do this soon, but should keep this in mind for future work. I think more and more folks wi... [18:59:05] ottomata: ok, sounds cool. The question is is there a way to persist it somehow? Should I just persist it manually (e.g. write java app that does it)? [18:59:29] ottomata: also, do KafkaStreams have support for KafkaSSE? That would probably make it easier to work with [18:59:41] no [18:59:58] EventStreams is meant more to work with client side type services [19:00:11] you can consume in a browsert etc. [19:00:22] SMalyshev: i don't know if any SSE/EventSource libs have persistence [19:00:28] since they are usually meant to work client side [19:00:29] but [19:00:37] all you have to do is save the Last-Event-ID header [19:00:43] and then provide it if you need to reconnect [19:01:13] or, more precisely [19:02:15] while consume message { [19:02:16] offsetFile.write(message.id) [19:02:16] ... [19:02:16] } [19:02:16] # on start up of process [19:02:16] if offsetFile.exists() [19:02:17] lastEventIdHeader = offsetFile.read() [19:02:17] ... [19:02:18] something like that ^ [19:03:57] ottomata: wait, so kafkaSSE is nodejs module? [19:04:23] so that means if I want to produce to local kafka setup I'll have to make it in nodejs, right? [19:04:47] joal: fixed - it's now running on one core only, TF has a function for that. For the moment it's ok :) slow, but ok. Sorry abt that. [19:05:20] miriam: Thanks a lot, I'd be super happy if you use many cores but leave 1 or 2 for system :) [19:05:44] miriam: Like that hos doesn't complain and it still runs fster [19:05:59] miriam: But I don't know TF well, so can't help on config really [19:09:14] joal: no prob, it's just a parameter in the configuration of the TF Session! I can try to add more. thanks a lot! [19:10:27] oh I got it... SSE library is for producing things, not consuming... [19:10:47] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968668 (10Nuria) Ah , no, it just skipped the archive run: INFO [2018-02-13 10:15:31] Skipped website id 13, already done 23 hours 18 min ago, Time elapsed... [19:19:19] SMalyshev: yeah i just linked there for the Last-Event-ID docuemtnation [19:22:26] ok nuria_, joal dunno what i'm doing wrong here https://gerrit.wikimedia.org/r/#/c/407508/ [19:22:34] still not picking up the test maxmind dbs [19:22:39] i've added the files [19:22:40] and also [19:22:45] https://gerrit.wikimedia.org/r/#/c/407508/9/refinery-job/pom.xml [19:22:50] i changed to **/Test*.scala [19:22:52] instead of *.java [19:22:54] but same result [19:23:39] ottomata: I'll have a look later or tomorrow depending on j8 [19:24:23] ok [19:24:30] hmm, maybe i need it in a different plugin conf? [19:24:37] is the scala test different than maven surefire? [19:25:33] yes ottomata - scala tests are run by scalatest-maven-plugin [19:25:50] ottomata: let me fix piwik stuff and will look at that in 2 mins [19:25:52] ok, so i need to find a way to set system properties in that plugin [19:25:53] looking [19:25:56] thanks joal that's gotta be it [19:25:59] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [19:26:10] ottomata: ah yes [19:26:19] ottomata: scala tests do not run via junit [19:27:02] joal: 1005 is still super slow [19:27:08] joal: right? [19:27:27] nuria_: I have not tried [19:28:02] nuria_: should be ok - TF uses less than half the cores and is noced - other processes should be ok [19:29:54] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968815 (10Nuria) Ok, i see 6 visits from yesterday, take a look: http://piwiki.wikimedia.org @bmansurov there is a file on stat1005 on my homedir with your... [19:29:59] (03PS10) 10Ottomata: Add TransformFunctions for JsonRefine job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/407508 (https://phabricator.wikimedia.org/T185237) [19:30:23] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968818 (10Nuria) @bmansurov stats are delayed up to 24 hours thus data for current day is always partial [19:31:53] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968824 (10Nuria) file is /home/nuria/piwik-research [19:32:08] RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [19:37:22] (03CR) 10Ottomata: [C: 032] Refactor geo-coding function and add ISP [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/403916 (https://phabricator.wikimedia.org/T167907) (owner: 10Joal) [19:38:12] I DID IT! thanks yall [19:38:14] tests pass [19:38:39] fdans, milimetric : was there an agreement on the table data alignment on wikistats? [19:39:23] ottomata: ok [19:39:42] joal: let's merge the geo code once upgrade is done and has baked a bit [19:40:01] nuria_: i just merged geo code, thought joal said it was ok [19:40:04] haven't deployed anything, just merged [19:40:07] ottomata: yaya [19:40:15] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968871 (10bmansurov) @Nuria, https://piwik.wikimedia.org won't let me in with the username/password you provided. I can login with my Gerrit credentials thou... [19:40:15] nuria_: merge can happen anytime (it'll facilitate ottomata life) - let's just wait before deploy [19:40:20] ottomata: it is fine, we tested it with data a while back [19:40:37] ottomata: in order to deploy it we need to stop/start refined jobs though [19:42:47] nuria_: right aligned for numbers, that's for sure, but it looks kind of weird with the dates. We can leave it like that for now [19:42:51] 10Analytics-Kanban, 10Research-landing-page, 10Patch-For-Review: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3968880 (10Nuria) Sorry, there are two levels of access: 1)ldap (usual credentials) 2) piwik [19:43:32] fdans: then we need to right align this table: https://stats.wikimedia.org/v2/#/all-projects/reading/pageviews-by-country [19:43:37] cc milimetric [19:44:42] nuria_: why, the numbers there are right-aligned [19:46:17] (03PS1) 10Ottomata: Add TransformFunctions for JsonRefine job [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410240 (https://phabricator.wikimedia.org/T185237) [19:46:41] milimetric: ok, wait, numbers right aligned and labels (for countries) left aligned? [19:47:08] https://usercontent.irccloud-cdn.com/file/CcqyeUm5/Screen%20Shot%202018-02-08%20at%201.26.26%20PM.png [19:47:10] ideally, yeah, that's great, because it's easier to see what they go with [19:47:13] (03CR) 10Ottomata: "Hm, somehow have pushed a new change with an identical Change-Id for this jsonrefine branch. Old one is at https://gerrit.wikimedia.org/r" [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410240 (https://phabricator.wikimedia.org/T185237) (owner: 10Ottomata) [19:47:14] but for dates it's harder [19:47:20] so we can leave them like that [19:47:23] milimetric: ok, then nothing to do [19:47:27] yea, it's cool [19:47:34] cc fdans [19:47:48] (03PS1) 10Ottomata: Add dataframe conversion to new schema function [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 [19:48:06] yep, +1 on right alignment of number [19:48:23] I got a branch with nuria's alternative, but I'm with milimetric on this [19:48:25] (03CR) 10Ottomata: "Changed branches, created new change with identical Change-Id. Old one is here: https://gerrit.wikimedia.org/r/#/c/410154/" [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 (owner: 10Ottomata) [19:54:21] 10Analytics, 10Operations, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3968921 (10Nuria) As of me checking today referral groups are mostly unchanged from the graphs i pasted above. That is, safari sessions ap... [19:57:03] mforns: fyi, i'm not sure yet but, with joseph's patch today, json refine may become much more abstract [19:57:16] ottomata, aha [19:57:20] it might be a general search, transform, alter hive tables, insert script [19:57:26] that coudl work with any dataset readable by spark [19:57:28] so [19:57:36] ottomata: sorry to interrupt :) [19:57:37] https://gerrit.wikimedia.org/r/#/c/410244/1/modules/profile/manifests/java/analytics.pp [19:57:45] mind to review --^ ? [19:57:53] ah snap druid [19:57:54] we could reuse that? [19:57:55] you might be able to use the same job entry point (currently named JsonRefine) with your custom purging functions, to insert into new hive tables... [19:57:56] ufffffffff [19:57:57] yeah [19:58:16] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services: Add page delete/undelete and prop changes topics to EventStreams - https://phabricator.wikimedia.org/T187241#3968945 (10Smalyshev) [19:58:17] ottomata, ok, could look into that... [19:58:43] you'd point it at the stuff that private tables that 'jsonrefine' makes, but extract table name or database name differently, so that it will target new 'public' tables, and then give it your purging transform function [19:58:47] elukey: looking [19:58:59] actually, there's not much repetition, I'm reusing SparkJsonToHive [19:59:04] right, cool [19:59:09] yeah, so that is going to change so that instead of a base path [19:59:19] it takes a DataFrame [19:59:23] if this works, i'm trying it now... [19:59:35] aha [20:00:00] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services: Add page delete/undelete and prop changes topics to EventStreams - https://phabricator.wikimedia.org/T187241#3968945 (10Pchelolo) @Smalyshev But we already have topics and schemas for these events, see https://github.com/wikimedia/mediawiki-event-schemas/bl... [20:00:41] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services: Add page delete/undelete and prop changes topics to EventStreams - https://phabricator.wikimedia.org/T187241#3968975 (10Smalyshev) @Pchelolo Not in https://wikitech.wikimedia.org/wiki/EventStreams as far as I can see? [20:00:57] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services: Add page delete/undelete and prop changes topics to EventStreams - https://phabricator.wikimedia.org/T187241#3968978 (10Ottomata) @Pchelolo he just wants us to configure them for EventStreams, so you can get at them at e.g. stream.wikimedia.org/v2/stream/pa... [20:01:29] ottomata: what about druid?? [20:01:32] I thought about it now [20:01:55] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services: Add page delete/undelete and prop changes topics to EventStreams - https://phabricator.wikimedia.org/T187241#3968980 (10Pchelolo) Oh, ok, I've misread EventStreams for EventBus :) [20:03:12] * SMalyshev re-reads https://wikitech.wikimedia.org/wiki/Event* each time one of those is mentioned just in case :) [20:04:33] druid uses usr/bin/java so as long as we don't change it we are good [20:04:37] I think [20:04:43] ottomata: thoughts --^ [20:04:44] ? [20:05:03] haha SMalyshev yeah maybe I shoudl make an Event_Event_Event page :p [20:05:30] 10Analytics, 10Operations, 10Research, 10Traffic, and 6 others: Referrer policy for browsers which only support the old spec - https://phabricator.wikimedia.org/T180921#3968983 (10Tgr) If nothing seems more broken than before that's good enough for me :) The main goal was to prevent Edge and Safari from se... [20:05:34] elukey: i think that will be fine [20:05:34] heh :) [20:05:39] especialyy if you don't restart druid :) [20:11:44] 10Analytics-Kanban, 10Discovery-Analysis, 10Reading-analysis: Pageviews/Stats on dataviz-literacy.wmflabs.org - https://phabricator.wikimedia.org/T187104#3969006 (10mpopov) @Nuria: my blog post is ready to be published but I'm holding off until I add the piwik stuff :) if there's a chance you can help with t... [20:12:55] 10Analytics, 10Cloud-VPS, 10EventBus, 10Services (watching): Add page delete/undelete and prop changes topics to EventStreams - https://phabricator.wikimedia.org/T187241#3969018 (10Pchelolo) [20:25:26] 10Analytics-Kanban: English Wikivoyage traffic spike possible bot - https://phabricator.wikimedia.org/T187244#3969063 (10Milimetric) [20:25:32] HaeB: hello! You are the first user using a full java 8 cluster :) [20:25:46] elukey: yes! i was just about to ask [20:26:03] we are completing the migration as we speak, testing the regular jobs [20:26:12] (there's two other jobs on yarn though also) [20:28:46] great ... BTW, anything potential issues to look out for in particular? or conversely, what are the benefits we can expect from the upgrade? [20:29:12] you can now write spark code in java with lambdas? :) maybe that's only a benefit to some people ... :) [20:29:14] last trtend for ML geeks: https://arxiv.org/abs/1702.08835 [20:30:57] HaeB: nothing special should be noticeable - The change is super important for us to be able to upgrade some other stack pieces (druid) to java8, with strong improvement that time [20:31:30] also I am interested in seeing GC metrics and general performance of the new JVM with real prod traffic [20:31:33] :) [20:31:53] joal: it's like someone said 'wow, deep networks are expensive to train. Can i encourage the same pain with tree models?' :P [20:32:07] :D [20:41:05] milimetric: do you know answer to petrs q here? [20:41:05] https://gerrit.wikimedia.org/r/#/c/410251/ [20:41:07] last comment [20:41:40] a-team: people we are running on java 8 :) [20:41:58] \o/ kudos guys [20:42:08] awesome [20:42:15] checking, andrew [20:42:38] ottomata: mind to come in the cave a sc? [20:42:40] sec [20:42:43] a-team: only expected alarms are on banner realtime job [20:42:56] NIIIICE [20:42:57] can [20:43:04] a-team: this one is not super happy, but we'll investigate tomorrow [20:43:08] joal: any way to skip an oozie job that somehow lost data? Since you restarted everything i checked my oozie workflows and aparently one has been stuck since feb 1st waiting on a _SUCCESS tag from camus that never came [20:43:14] https://hue.wikimedia.org/oozie/list_oozie_coordinator/0123506-170829140538136-oozie-oozi-C/# [20:44:17] Pchelolo/ottomata: gerrit seems to be bugging out, not letting me sign in, but the answer is, the Analytics mysql replicas on dbstore1002 are not altered from production data in any way [20:44:31] only the labs replicas are redacted, but is that what you were asking about? [20:44:34] ebernhardson: not that I know ! ebernhardson - I'd probably either manually put a _SUCCESS file, but that's super not clean - Or kill/restart the coord (not nice either) [20:45:05] milimetric: basically if we're ok exposing page-properties-change stream [20:45:12] to the public? [20:45:21] and regarding gerrit - try opening it in chrome - that helped in my case [20:45:27] yeah, I'm in chrome :/ [20:46:29] milimetric: the labs ones [20:46:32] that's what we asking about [20:46:34] since they are public [20:46:45] * elukey off! [20:46:55] milimetric: there was an email thread about your bug on some list this mornign [20:46:58] ottomata: yeah, but so far we're not doing the same kind of sanitizing on event streams, stuff that's redacted from the labs replicas is not redacted in our streams [20:46:58] there is some cookie you need to delete [20:47:07] yes milimetric [20:47:09] but, in general [20:47:11] we just are wondering [20:47:18] are page property changes redacted in labs replicas? [20:47:22] if not [20:47:25] then we can just expose it without thikning about it [20:47:33] in general, if we want to match what's happening there, it's definitely complicated, we have to take a close look at how the DBAs are sanitizing the data and write corresponding stream sanitizers [20:47:42] I can check [20:47:52] k, we don't want to try to do any matching now, but just use it as a baseline [20:47:53] if it is there [20:48:00] we can assume someone has thought about it and it is ok to expose [20:48:12] ok, I'll check for page props in particular then [20:49:46] thank you milimetric [20:54:11] joal: looks like restart it is :) [20:56:48] RECOVERY - Number of banner_activity realtime events received by Druid over a 30 minutes period on einsteinium is OK: OK - druid_realtime_banner_activity is 2406 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1 [20:57:23] (03CR) 10Ottomata: Add dataframe conversion to new schema function (031 comment) [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410241 (owner: 10Ottomata) [20:59:58] fyi, joal since i moved it to a remote branch, the original changes are no longer the ones, i'm going to abandon them [21:00:00] Pchelolo / ottomata: ok, so I looked over all the redaction code that I could find, and found no mention of page_props being redacted in any way. I spot checked the table and it doesn't look like it's redacted, the same data was there on labs and production for the records I checked. The same number of records show up in labs and production, so individual rows aren't filtered. I think that's safe, but the way we redact is [21:00:00] not sane at all so I'm not 100% [21:00:17] (03Abandoned) 10Ottomata: Add TransformFunctions for JsonRefine job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/407508 (https://phabricator.wikimedia.org/T185237) (owner: 10Ottomata) [21:00:29] Pchelolo: to be 100% sure, you'd have to ask the DBAs [21:00:32] (03Abandoned) 10Ottomata: Add dataframe conversion to new schema function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/410154 (owner: 10Joal) [21:00:51] greaat, thanks milimetric [21:00:58] good enough for me [21:02:11] joal: not sure this thing is working for me... if you have a min, bc? [21:04:03] ottomata: OMW ! [23:20:37] (03CR) 10Nuria: "Much i do not understand, but liked the use of trait for helper and ratifies a bit the refactoring of geo code as it ends up being easier " (032 comments) [analytics/refinery/source] (jsonrefine) - 10https://gerrit.wikimedia.org/r/410240 (https://phabricator.wikimedia.org/T185237) (owner: 10Ottomata)