[03:49:00] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1034 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[04:00:10] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[04:39:07] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1033 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[04:43:36] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1033 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[05:35:33] <elukey>	 good morning 1033 and 1034, thanks for the OOMs
[05:35:42] <elukey>	 really a good start of the day
[07:35:31] <elukey>	 tried to build https://grafana-admin.wikimedia.org/dashboard/db/analytics-hadoop but there is something weird with the JMX metrics
[07:35:55] <elukey>	 I obtain always the second graph for log fatals, warnings, threads blocked ,etc...
[07:37:05] <elukey>	 plus GCCount always at 4?
[07:37:09] <elukey>	 mmm
[07:53:13] <aharoni>	 Hallo.
[07:53:22] <aharoni>	 Me again, asking about webrequest queries.
[07:54:14] <aharoni>	 I tried to run this: https://gist.github.com/amire80/b00ab67615dc68a1d03c816f8b1d7a10
[07:55:06] <aharoni>	 And then `SELECT prev, sum(n) FROM cross_wiki_navigation WHERE day = 21 GROUP BY prev ORDER BY prev LIMIT 1000;`
[07:55:26] <aharoni>	 And I get results for languages such as cy, bs and hr
[07:56:10] <aharoni>	 which is ultra-weird, because these are supposed to be referrals from domains like cy.wikivoyage.org, bs.wikivoyage.org and hr.wikivoyage.org
[07:56:16] <aharoni>	 and these domains don't exist.
[07:57:00] <aharoni>	 more precisely, they are redirects to incubator,
[07:57:22] <aharoni>	 but the big initial query is supposed to be only for targets that are Wikivoyage, not Incubator.
[07:57:27] <aharoni>	 How can this happen?
[07:57:37] <aharoni>	 (The schema is amire80.)
[08:00:13] <aharoni>	 joal ^
[08:12:12] <elukey>	 team: going afk for ~1hr, will be back online soon! :)
[08:21:20] <joal>	 aharoni: There must be rows that match your pattern
[08:32:13] <joal>	 Hey addshore, I'm sorry, I owe you 2 CRs
[08:32:22] <addshore>	 haha! :D
[08:32:41] <addshore>	 I would have poked you but yesterday ended up getting exactly 1 item down my 7 item list... :/
[08:32:54] <joal>	 addshore: I kinda know the feeling :)
[08:34:02] <joal>	 addshore: I'll ask questions in here instead of in gerrit, will be a bit faster :)
[08:34:08] <addshore>	 [=
[08:34:29] <joal>	 addshore: About graphite default namespace, daily.wikidata.articleplaceholder or no .wikidata?
[08:35:00] <addshore>	 ahh yes, so I am fine with it either way
[08:35:07] <addshore>	 I'll go and add the wikidata! :)
[08:35:20] <joal>	 addshore: so am I, more a question of data consistency :)
[08:37:40] <joal>	 addshore: Have you tested your last scala version?
[08:37:51] <joal>	 addshore: I can guess no, seems to be a bug ;)
[08:38:18] <addshore>	 you know, because of how abruptly I stopped doing it I actually can't remember
[08:38:26] <joal>	 huhuhu :)
[08:38:30] <joal>	 no problem
[08:38:42] <grrrit-wm>	 (PS12) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500)
[08:38:51] <joal>	 addshore: the change you pushed between version 10 to 11 seems to be a bug
[08:39:45] <addshore>	 hmmm https://gerrit.wikimedia.org/r/#/c/295896/10..11/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/WikidataArticlePlaceholderMetrics.scala ?
[08:39:56] <joal>	 yes
[08:40:03] <addshore>	 that will result in en_wikipedia right?
[08:40:19] <joal>	 addshore: You have 3 %s and only 2 values in format
[08:40:29] <addshore>	 oh bah
[08:40:39] <addshore>	 so I do
[08:40:47] <joal>	 :)
[08:40:48] <addshore>	 right, I need to replace the . in project with an _!
[08:41:30] <joal>	 addshore: only one %s at the end of metric string, and project.replace('.', '_')
[08:41:58] <grrrit-wm>	 (PS13) Addshore: Add WikidataArticlePlaceholderMetrics [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500)
[08:42:17] <joal>	 addshore: And I push you to use spark-shell to test :)
[08:42:23] <joal>	 addshore: Have you done that before?
[08:42:36] <addshore>	 nope, you talked of spark-shell but I didn't end up trying it!
[08:42:45] <joal>	 you have aminute to try that now?
[08:42:49] <addshore>	 sure! :D
[08:42:55] <joal>	 ok :)
[08:43:27] <joal>	 so in stat1002: spark-shell --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G
[08:44:11] <joal>	 This command starts a scala spark shell, using yarn (our hadoop cluster) as global container, and with executors having the settings defined
[08:44:28] <joal>	 Should take a few seconds to give you a scala prompt
[08:44:40] <addshore>	 yup, I have one!
[08:44:46] <joal>	 great
[08:45:22] <joal>	 In there, as the logs tell you, you already have sc (SparkContext) and sqlContext (HiveContext) defined, no need to instanciate them anew
[08:46:11] <joal>	 My way to test code like yours is: copy-paste the Params class
[08:46:55] <joal>	 Then create a test instance of params:
[08:48:23] <joal>	 val params = Params(year=2016, month=6, day=30, graphiteHost="graphite-in.eqiad.wmnet", namespace="daily.test.wikidata.articleplaceholder")
[08:48:38] <addshore>	 yup
[08:49:28] <joal>	 Then paste commands one by one and check (with modifs to print the metric instead of sending to graphite)
[08:50:08] <addshore>	 ahh, so it has sqlContext and not hiveContext?
[08:50:09] <joal>	 You'll see some errors: need to rename hiveContext to sqlContext
[08:50:17] <joal>	 And to import DateTime
[08:51:41] <joal>	 addshore: getting is the habit of testing small portions of code like that, I usually use sqlContext as variable name even if it's a hive context: )
[08:52:18] <joal>	 addshore: Another interesting thing with those shells is that you can easily use them for one shots queries / analysis
[08:52:20] <addshore>	 okay!
[08:52:45] <joal>	 addshore: If you prefer, the thing is even available in python
[08:53:00] <joal>	 addshore: But I prefer scala, so I tend to present it in scala first ;)
[08:55:19] <addshore>	 cool, so it all looks good as far as I can tell!
[08:55:24] <joal>	 addshore: Something else to know: the error logs about lost executors are not actually errors
[08:55:50] <joal>	 addshore: Spark uses dynamic allocation: It instanciates executors as it needs it, then release them to share resources on the cluster
[08:56:08] <joal>	 addshore: The logs are a false alarm about executors being release for others to use
[08:56:24] <joal>	 addshore: This is a known bug of spark, corrected in next version :)
[08:56:55] <joal>	 addshore: I tested the lines while writing to you how to do it, the code seems correct (and the result expected)
[08:57:19] <joal>	 addshore: Now, one last thing is to test the real thing as we did before, building the jar and so on
[08:58:00] <joal>	 addshore: The spark-shell test is usefull for building code and making sure stuff works as expected, but a global integration test is still better: )
[08:59:21] <joal>	 addshore: Sorry for the long discussion
[09:01:04] <addshore>	 lets see if I can remember :)
[09:01:37] <addshore>	 packaging the latest stuff on stat1002 now
[09:03:05] <joal>	 k
[09:07:56] <addshore>	 running
[09:08:16] <addshore>	 nope, think I got the command wrong :D
[09:12:12] <joal>	 addshore: :D
[09:12:34] <addshore>	 right, I think it ran *checks graphite*
[09:13:46] <addshore>	 hmm
[09:15:01] <joal>	 addshore: seems to have worked :)
[09:15:10] <joal>	 addshore: no?
[09:15:50] <addshore>	 yeh, i'm just wondering why lv_wikipedia is the only thing reported, but that may just be correct
[09:16:03] <addshore>	 --year 2016     --month 06     --day 28 guess ill check with hive first
[09:16:17] <joal>	 addshore: My guess is that the string matching you are doing depends on projects
[09:21:36] <addshore>	 yup, hive says the data is only from lv for the hour I checked
[09:23:50] <addshore>	 in a 5 min meeting, then I'll be back!
[09:24:24] * elukey back!
[09:28:56] <elukey>	 a-team I am going to reboot eventlog* hosts
[09:29:11] <elukey>	 security updates
[09:29:19] <joal>	 elukey: have you seen yesterday's meessage from ori about EL erros from old ZMQ still alive instances?
[09:29:25] * elukey things about the meme "Brace yourselves, winter is coming"
[09:29:44] <joal>	 elukey: might be wise to triple check what's running and so on
[09:29:51] <joal>	 elukey: Right
[09:29:55] <grrrit-wm>	 (PS2) Addshore: DRAFT Ooziefy Wikidata ArticlePlaceholder Spark job [analytics/refinery] - https://gerrit.wikimedia.org/r/296407
[09:30:59] <elukey>	 joal: I've read the backlog but didn't find any reference to ZMQ
[09:31:02] <elukey>	 re-reading
[09:31:19] <elukey>	 anyhow, eventlog2001 is safe to reboot
[09:31:26] <elukey>	 1001 may need to wait for ori
[09:31:28] <joal>	 elukey: The instances having issues are the ones perf team use, which are still using ZMQ IIRC
[09:31:33] <joal>	 k elukey
[09:31:37] <joal>	 you know best )
[09:31:45] <addshore>	 joal: I have just updated the oozie bit with the new namespace too! :)
[09:31:58] <joal>	 addshore: will have a lok
[09:51:16] <elukey>	 eventlog2001 rebooted, nothing on it :)
[10:11:48] <moritzm>	 elukey: damn, I accidentally rebooted eventlog[12]001, sorry for that :-/
[10:12:21] <moritzm>	 wrong command from my bash history, meh...
[10:13:09] <moritzm>	 wanted to run "uname -a" instead :-/
[10:14:43] <moritzm>	 both back up, sorry for any inconvenience caused by that :-/
[10:24:31] <grrrit-wm>	 (CR) Joal: [C: 2] "LGTM" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/295896 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[10:24:41] <addshore>	 wahahha :P
[10:25:04] <joal>	 ?
[10:26:16] <addshore>	 replace wahahha with woooo! ;)
[10:26:21] <joal>	 huhu
[10:30:46] <elukey>	 moritzm: sorry just seen the msg
[10:30:55] <elukey>	 checking, should be fine!
[10:30:55] <icinga-wm>	 PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [30.0]
[10:31:13] <elukey>	 need only a restart probably, or something similar
[10:31:57] <moritzm>	 oK!
[10:32:16] <elukey>	 all good! https://grafana.wikimedia.org/dashboard/db/eventlogging
[10:32:21] <elukey>	 the alarm will be back inline soon
[10:32:42] <elukey>	 it didn't make any sense to restart it since it was rebooted :/ silly me
[10:34:25] * elukey looks at https://grafana.wikimedia.org/dashboard/db/performance-metrics
[10:34:38] <elukey>	 wow way cooler than what I remember
[10:36:59] <elukey>	 joal: have you a good ssh command to use for jvisualvm by any chance?
[10:37:32] <elukey>	 I am trying ssh iron.wikimedia.org -L 9981:analytics1034.eqiad.wmnet:9981 -N -v and then a jmx conn to localhost:9981
[10:37:33] <joal>	 elukey: never used jvisualvm
[10:37:37] <elukey>	 ah okok
[10:37:41] <joal>	 sorry :(
[10:38:06] <joal>	 elukey: I usually jump to analytics using stat100x
[10:38:16] <elukey>	 yeah tried that but it doesn't work
[10:38:20] <elukey>	 weird
[10:38:22] <icinga-wm>	 RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0]
[10:38:30] <elukey>	 moritzm: --^ all good
[10:38:39] <elukey>	 joal: I wanted to check the datanode's GC graphs
[10:38:46] <elukey>	 and Heap sizes over time
[10:38:52] <joal>	 elukey: yes, understood that
[10:40:51] <elukey>	 the jmx metrics look weird
[10:40:52] <elukey>	 :(
[10:48:09] <joal>	 elukey: need any help?
[10:48:17] <joal>	 also elukey, any news from Alex?
[10:48:20] <moritzm>	 ok, good :-)
[10:49:17] <elukey>	 joal: nono I am reading some things on GC, refreshing memory.. I am going to write an email soon, didn't get the time
[10:49:27] <joal>	 k elukey :)
[10:49:57] <joal>	 elukey: not having things to work with you today makes me feel unusual ;)
[10:50:52] <elukey>	 joal: don't say that! It is friday and all sorts of things can explode! :D
[10:51:10] <joal>	 :D
[10:56:36] <wikibugs>	 Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690#2421018 (elukey) Open>Resolved
[10:57:01] <wikibugs>	 Analytics, Analytics-Cluster: cdh::hadoop::directory (and other hdfs puppet command?) should quickly check if namenode is active before executing - https://phabricator.wikimedia.org/T130832#2421021 (elukey) a:elukey>None
[11:18:43] * elukey commutes to the office!
[11:24:45] <grrrit-wm>	 (CR) Joal: [C: -1] "Not bad for a first oozie try :)" (19 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/296407 (owner: Addshore)
[11:25:27] <joal>	 addshore: Just realised that it would be good to update your scala code to name the namespace parameter: graphite-namespace, explicit for the win :)
[11:30:41] <joal>	 milimetric: Hi !
[11:34:37] <wikibugs>	 Analytics-Tech-community-metrics: Check whether mailing list activity per person on korma is in sync with current "mlstats_mailing_lists.conf" - https://phabricator.wikimedia.org/T132907#2421092 (Qgil)
[11:53:15] <milimetric>	 hey joal
[11:55:08] <joal>	 milimetric: Good morning o/
[11:56:39] <joal>	 milimetric: can we try settle globally on splitwise? I need to send my invoice for the month, including the Berlin per-diems
[11:56:39] <milimetric>	 mornin'
[11:57:01] <milimetric>	 oh yes, I was waiting for everyone to be back, just in case someone hadn't done their expenses
[11:57:13] <milimetric>	 but I'm good, I added all mine
[11:57:25] <joal>	 milimetric: so did I
[11:57:49] <joal>	 milimetric: even if not perfect, I just prefer not to make decision alone as to how many days I should ask for :)
[11:58:46] <milimetric>	 hm, let's see so worst case someone forgot to put something in, in which case we can give that person a few more per-diems to make up for it.  I think it's ok, I'll send a message to internal and go to splitwise and do it
[11:59:40] <milimetric>	 hm, some people didn't add their expenses shared with everyone and it looks like people added their 5-day transport passes
[12:00:22] <joal>	 milimetric: wow, it'll get tricky
[12:00:33] <milimetric>	 no, not at all
[12:00:48] <milimetric>	 if we just share everything with everyone, it will average out
[12:01:15] <joal>	 k, but for instance I think only nuria put her travel pass
[12:01:26] <joal>	 mwarf, nevermind :)
[12:01:43] <joal>	 it's a few euros, it doesn't matter :)
[12:02:19] <milimetric>	 right, but everyone should put their travel pass.  So we have to wait until everyone does
[12:02:34] <milimetric>	 either way, the per-diems will pay for everything in the end
[12:02:58] <milimetric>	 like, nobody is paying anything out of pocket.  Let's hangout, I hate text :)
[12:03:10] <milimetric>	 batcave work for you?
[12:03:14] <joal>	 will try
[12:03:27] <milimetric>	 btw, you said something about the calendar yesterday?  You access the cave from the calendar?
[12:03:37] <joal>	 yessi
[12:03:42] <milimetric>	 ew, why? :)
[12:03:43] <milimetric>	 https://hangouts.google.com/hangouts/_/wikimedia.org/a-batcave
[12:03:46] <milimetric>	 just bookmark that
[12:26:54] <nuria_>	 hello elukey and joal and milimetric
[12:27:01] <milimetric>	 hi nuria_
[12:27:19] <milimetric>	 omg, it's crazy early for you no?
[12:27:36] <nuria_>	 milimetric: jaja jetlag-> makes you more productive
[12:27:55] <elukey>	 o/
[12:28:06] <nuria_>	 milimetric: did you share with team anything of interest you learned at wikimania?
[12:28:16] <milimetric>	 not yet, I was waiting for everyone to be back
[12:28:30] <milimetric>	 I have a long list of potential follow-ups
[12:28:43] <milimetric>	 which we could consider in our prioritization.  And Madhu might as well
[12:29:05] <joal>	 Hey, nuria_ is back ! o/
[12:30:25] <nuria_>	 sounds good, let's do that , lemme schedule a staff meeting cause we do not have one next week
[12:31:23] <milimetric>	 joal: I changed your bills to euros
[12:31:24] <nuria_>	 milimetric: actually let's use the retro for that , I will cancel retrospective next week and we can talk on that slot about wikimania , sounds Good?
[12:31:44] <milimetric>	 also, lol, how did everyone manage to get a different price for the 5-day pass?  Someone in Berlin's making some money :)
[12:31:55] <joal>	 milimetric: Wow thanks ! I think I was indeed inconsistent
[12:32:02] <joal>	 :D
[12:32:20] <milimetric>	 nuria_: sure, sounds good
[12:34:35] <milimetric>	 yay show and tell :)
[12:34:41] <joal>	 milimetric: corrected mine to adjust to yours (lost the ticket, did it from memory)
[12:34:46] <nuria_>	 elukey: what about varnish4 issues ? we are still seeing alarms
[12:34:58] * milimetric never had show and tell growing up in Romania and he's been making up for it with a vengeance ever since
[12:35:22] <milimetric>	 joal: i looked up what they charge at the airport
[12:35:27] <milimetric>	 (that's where I bought mine)
[12:35:31] <joal>	 right
[12:35:47] <joal>	 Sounds ok with what I have in mind (30ish)
[12:40:37] * elukey is failing miserably to use jvisualvm or jconsole
[12:44:45] * elukey just needed to proxy different ports
[12:45:04] * elukey goes in the corner to cry
[12:48:04] <nuria_>	 elukey: ay ay
[12:51:59] <elukey>	 still unable to make jvisualvm to work
[12:52:05] <elukey>	 but jconsole shows things
[12:56:44] <elukey>	 so I am checking analytics1034 HDFS datanode's JVM
[12:56:57] <elukey>	 the avg heap size used is ~300MB
[12:57:06] <elukey>	 but the max is 932MB
[12:57:16] <elukey>	 and we have -Xmx1000
[12:57:38] <elukey>	 94 live threads, peak to 211
[12:57:51] <joal>	 elukey: giving it some space is not that expensive, particularly if it does a reasonnably correct job at regularly GC
[12:58:32] <elukey>	 yeah I agree, but I'd love to have all the metrics somewhere from now on to watch how JVMs are behaving
[12:59:02] <elukey>	 2GB for the datanode seems good
[12:59:36] <elukey>	 GC is default one, PS MarkSweep for Old gen
[13:00:10] <elukey>	 not sure about the yarn ones but we might think to test G1
[13:00:22] <joal>	 elukey: I don't really think it makes a huge difference for us, as long as it happens regularly and cleans after peaks
[13:01:31] <elukey>	 well if we improve throughput and GC pauses it won't be bad.. we are not latency bound but at the same time we might want to think about efficiency :)
[13:03:47] <nuria_>	 elukey, joal: did we move forward in the testing of cassandra with the new configuration?
[13:05:48] <elukey>	 kindof
[13:06:09] <joal>	 nuria_: there have been some thinking and coding, but no real move
[13:06:28] <elukey>	 we have been working on making bulk loading from hadoop working
[13:06:39] <elukey>	 but we haven't managed yet
[13:07:06] <elukey>	 plus we are thinking about migrating to cassandra 2.2.6 now rather doing it afterwards..
[13:07:08] <joal>	 nuria_: Here is the debrief: Bulk loading works, but is tricky, due to the version we use being in the middle of cassandra moving from Thrift to NativeCQL
[13:07:10] <nuria_>	 joal, elukey : ok, let's talk about this on standup so as to prioritize work cause i think there are three things going on now: 1) scaling of pageview Api 2) varnishkafka issues (i think we are just waiting for traffic team , but let me know otherwise) 3) new issues with rebooting hosts
[13:07:46] <elukey>	 sure, post-standup looks good
[13:08:04] <nuria_>	 joal, elukey : ok, let's go over whether there any blocker so as to proceed with that work
[13:08:07] <elukey>	 on the varnishkafka side I managed to get down to zero errors on my test hosts
[13:08:30] <nuria_>	 elukey: by changing buffer sizes and such as we discussed?
[13:08:35] <joal>	 nuria_: And while working on that, we realised that taking advantage of reloading the full cluster on new cassandra version (services team is migrating) should prevent us to migrate later on
[13:08:40] <elukey>	 removing Websockets upgrade request and increasing the timeout to 1800
[13:08:47] <joal>	 More detaills post-standup
[13:09:01] <elukey>	 joal: busy week :P
[13:11:07] <nuria_>	 elukey, joal: ok
[13:12:21] <joal>	 elukey: however we still see a lot of misc inconsistency from hadoop
[13:12:38] <elukey>	 joal: ?
[13:12:50] <joal>	 Well oozie sends error almost every hour
[13:13:16] <joal>	 nsorry, warning, not error, but error once in a while
[13:14:30] <joal>	 elukey: if traffic agrees the solution you offer is to last, then we should adpat our validation metric
[13:16:23] <elukey>	 mmm I was talking about test hosts, didn't set up anything yet :)
[13:16:45] <elukey>	 or maybe I am missing the point joal, sorry, Friday ENOCOFFEE etc.. be patient :)
[13:17:56] <joal>	 Hoooo, I think it's me needing coffe then ;)
[13:24:08] <addshore>	 joal: I'll take a look at your comments this weekend or this evening! :)
[13:24:48] <joal>	 addshore: No problem, we can discuss that next week
[13:32:45] <addshore>	 yup!
[13:34:42] <elukey>	 joal: I had a chat with Alex, he gave me a good suggestion, I am not sure why I haven't thought about it before.. We can pool/de-pool hosts for the LVS IP related to AQS via etcd
[13:35:03] <elukey>	 so if we keep the old nodes up and running (and up to speed)
[13:35:18] <elukey>	 we will be able to flip the cluster entirely in a matter of seconds
[13:35:29] <elukey>	 from palladium
[13:35:40] <elukey>	 without any DNS change etc...
[13:36:31] <elukey>	 (this in case we find some weird issue with the new cluster that we didn't see while testing)
[13:36:51] <elukey>	 also, testing 2.2.6 seems to be a good idea :)
[13:42:01] <joal>	 elukey: Great news :)
[13:42:32] <joal>	 elukey: it's also a good idea, the pool/depool
[13:42:44] <joal>	 elukey: we are still far from it, but it's good !
[13:43:02] <elukey>	 it is simply awesome, we will be able to rollback in case of fire almost instantly..
[13:43:29] <elukey>	 this of course does not mean that we'll put untested stuff in prod (note for whoever is reading :)
[13:43:59] <joal>	 definitely not, that's far too dangerous to already try in in test, so no prod
[13:45:01] <joal>	 elukey: Let's wait to have nuria_ on track before taking actions, but we should plan on installing cassandra-2.2.6 early next week
[13:46:50] * elukey looking forward to it
[13:47:46] <joal>	 elukey: you fake it so well ;)
[13:48:10] <elukey>	 no really it will be fun :)
[13:48:29] <joal>	 elukey: hm, given what urandom told us, not sure fun is the correct word ;)
[13:48:42] <elukey>	 we'll move from the worst performing restbase service to the most up to date and cool one
[13:49:06] <joal>	 By the way, might be good to discuss a bit with urandom today, to now what actions need to be taken to install wokring instances of the new beast
[13:49:13] <joal>	 elukey: --^
[13:49:17] <joal>	 elukey: true !
[13:49:33] <elukey>	 yeah, afaik the 2.2.6 deb should have the patches
[13:49:35] <joal>	 elukey: And normally bulk loading should work OUT OF THE BOX !
[13:49:45] <elukey>	 ah good point! I have a question
[13:50:03] <elukey>	 is bulk loading going to require access to CQL?
[13:50:16] <elukey>	 instead of thrift
[13:50:29] <joal>	 elukey: it will
[13:50:47] <joal>	 elukey: or at least, I think so, I should double check the code
[13:50:57] <joal>	 elukey: i WILL double check the code, now
[13:51:06] <elukey>	 and we have srange => "(@resolve((${cassandra_hosts_ferm})) @resolve((${aqs_hosts_ferm})) ${analytics_networks})", for CQL
[13:51:09] <elukey>	 GOOOOOOD
[13:51:17] <elukey>	 we probably need to put only hadoop in there
[13:51:35] <elukey>	 mmm or not
[13:51:45] <elukey>	 mmmmmm (ottomata's return is close)
[13:52:09] <joal>	 huhuhuu :)
[14:07:42] <wikibugs>	 Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#2421318 (Dvorapa)
[14:08:47] <wikibugs>	 Analytics-Cluster, Patch-For-Review: Java OOM errors kill Hadoop HDFS daemons on analytics* - https://phabricator.wikimedia.org/T139071#2421332 (elukey) p:Triage>Normal a:elukey
[14:09:03] <wikibugs>	 Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Java OOM errors kill Hadoop HDFS daemons on analytics* - https://phabricator.wikimedia.org/T139071#2418521 (elukey)
[14:12:00] <wikibugs>	 Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#2421351 (tom29739)
[14:15:59] <milimetric>	 joal: ok, what shall we pair on
[14:16:09] <milimetric>	 your thing or mine
[14:17:03] <wikibugs>	 Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2421374 (elukey) Ran an experiment on cp300[89] and the following query removes all the occurrences of VSL timeout:  ``` varnishlog...
[14:17:16] <joal>	 milimetric: currently hooked on admin stuff, will be ready for some scala in half an hour or so
[14:18:03] <milimetric>	 ok, sweet, I'll try and see what I can figure until then
[14:23:47] <wikibugs>	 Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#2421318 (jcrespo) Query limit is also enforced at server time, do do not worry about this having any kind of impact on our infrastructure. It would be interesting to fix the error/mistake/race condition/bug over...
[15:13:44] <wikibugs>	 Quarry: Wrong status of queries in Recent Queries list - https://phabricator.wikimedia.org/T137517#2421580 (Dvorapa)
[15:14:37] <wikibugs>	 Quarry: Wrong status of queries in Recent Queries list - https://phabricator.wikimedia.org/T137517#2370631 (Dvorapa)
[15:16:55] <wikibugs>	 Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#2421602 (Dvorapa) But I wait more than 2 hours now and this query is completed, but the query above #6992 is now running more than 1 hour.  The bug you mena is not the same as this, but you can find it in T137517
[15:18:35] <joal>	 milimetric: Yay, managed to finish the thing !
[15:18:49] <joal>	 milimetric: Can I scala with you?
[15:21:11] <milimetric>	 YES! OH GOD YES :)
[15:21:14] <milimetric>	 to the batcave!
[15:21:22] <joal>	 OMW !
[16:12:53] <wikibugs>	 Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#2421716 (Dvorapa) Just one query completed in 2 hours: {F4221592}
[16:24:02] <wikibugs>	 Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#2421318 (yuvipanda) I just restarted the runner, hopefully that'll make things better.
[16:26:18] <wikibugs>	 Analytics-Kanban: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178#2421739 (Nuria)
[16:30:14] <wikibugs>	 Analytics-Kanban: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178#2421755 (Nuria)
[16:31:27] <wikibugs>	 Analytics-Kanban: Cleanup terabytes of logs on hdfs - https://phabricator.wikimedia.org/T139178#2421739 (Nuria) Configuration setting such logs get deleted after say, a month? we have now logs that are 2 years old.
[16:33:36] <leila>	 hi mforns. I'm going through the thread with Marc Miquel on Analytics. What do you mean by "This would not be retroactive though, we would have to wait a couple  months to collect significant data. In any case, I'm not sure if this  would be possible with an NDA?"
[16:33:37] <leila>	 ?
[16:34:21] <mforns>	 leila, I meant if we use EventLogging to collect time on page we would not have historical data, only from now on
[16:34:50] <leila>	 yes. I agree. what is your point about the NDA, mforns?
[16:35:00] <leila>	 did you mean "without"?
[16:35:35] <mforns>	 and also, that I'm not sure if a person from outside WMF (with NDA) would be able, or would it be easy for them in terms of collaboration with lots of teams, and touching multiple repositories to get that done.
[16:36:25] <leila>	 I got you, mforns. Yes, it would be possible, but it will be time-consuming and we normally don't do that, unless the project is justified to be valuable for the direction we are moving towards.
[16:36:53] <leila>	 I'll comment on the thread to clarify the NDA component, because I think Marc expected that we set up a quick NDA and give him access and he figures out the rest, mforns.
[16:37:00] <mforns>	 no, I meant, I was not sure if it would be practically feasible for a person from outside WMF to modify MediaWiki code + create EventLogging schema + all the discussion and work associated
[16:37:29] <leila>	 ooo, yes. that part won't work, mforns. Someone will need to create the schema in practice.
[16:37:33] <mforns>	 leila, yes exactly, you described it better :]
[16:37:57] <leila>	 I'll add my comments as a complement to yours. Thanks.
[16:38:02] <mforns>	 sorry, my messages are a bit lagged
[16:38:08] <mforns>	 leila, thanks!
[16:38:22] <leila>	 np. thank YOU! :)
[16:39:39] <wikibugs>	 Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#2421780 (Dvorapa) Looks like working, thank you
[16:49:37] <wikibugs>	 Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162#2421815 (yuvipanda) I ran:  ``` MariaDB [quarry]> UPDATE query join query_revision on query.latest_rev_id = query_revision.id  join query_run on latest_run_id = query_run.id SET status=1 where (status = 2 or stat...
[17:00:44] <nuria_>	 leila: I was about to send a response to that thread
[17:00:52] <nuria_>	 leila: just did cc mforns
[17:09:23] <mforns>	 milimetric, can you hear me?
[17:12:08] <leila>	 nuria_: oh. just saw your response. you mean you meant to send another response?
[17:12:34] <nuria_>	 leila: no, that's it. I just wanted to re-iterate that the "data does not exist"
[17:12:44] <leila>	 yup. makes sense, nuria_.
[17:12:56] <nuria_>	 leila: not that "it is hard to get"
[17:13:03] <nuria_>	 leila: we just do not have it
[17:13:48] <leila>	 I mean, he can do approximations, nuria_, but it will be a lot of work, and it will not be exact.
[17:14:17] <nuria_>	 Lila: I think it is not possible to even do approximations
[17:14:22] <nuria_>	 leila: sorry
[17:14:26] <leila>	 so it won't be worth it in his case, nuria_. but generally, I wanted him to know that we don't give away access very easily, and there is a process we try to adhere to.
[17:14:28] <nuria_>	 not Lila , juas!
[17:14:51] <leila>	 nuria_: approximations on mobile are easier.
[17:14:51] <nuria_>	 leila: ya, worth it too cause we can point to taht archived thread for future reference
[17:15:06] <leila>	 people open fewer new tabs on mobile, and even fewer in apps.
[17:15:07] <nuria_>	 leila: how so?
[17:15:25] <leila>	 so you won't have the problem of 40 tabs being opened at once and only one of them being read.
[17:15:26] <nuria_>	 leila: but you have IP NAT-ing, shared ips
[17:15:41] <nuria_>	 leila: thus it is harder to pinpoint individual usage
[17:15:49] <nuria_>	 leila: for mobile web requests
[17:16:17] <leila>	 yes, nuria_, though that doesn't seem to be a very big problem for the work we're doing. Using the user_agent and IP even on mobile seem to be a good enough approximation for the work Ellery is doing now.
[17:16:35] <leila>	 yeah, in general the cleanest would be app, nuria_.
[17:54:38] <milimetric>	 mforns: verizon said it could take 24 hours :(
[17:54:46] <mforns>	 milimetric, no problem
[17:54:56] <milimetric>	 I'm on my phone hotspot but I don't think I have enough data in my plan to do a hangout
[17:54:58] <mforns>	 are you in your phone?
[17:55:06] <mforns>	 np
[17:55:41] <milimetric>	 sorry, it kind of sucks, if it comes back soon I'll ping you otherwise we can chat next week
[17:55:56] <mforns>	 milimetric, sure! I will write on an email what I'm doing right now
[17:56:09] <milimetric>	 we can of course chat on here if you like, up to you
[17:56:21] <mforns>	 sure!
[17:56:24] <milimetric>	 (I'm gonna check my data plan actually, see how much I have left)
[17:56:50] <mforns>	 milimetric, I think that the new namespace is actually stored consistently in the log_params prefix!
[17:57:11] <mforns>	 I think we can parse it from there
[17:57:33] <mforns>	 looked for counterexamples across time, but didn't find any
[17:58:01] <mforns>	 the only problem would be knowing the namespace names for other languages
[17:58:02] <milimetric>	 params?!!!
[17:58:17] <milimetric>	 oh, you mean as part of the title
[17:58:26] <mforns>	 milimetric, yes
[17:58:35] <mforns>	 the prefix
[17:59:05] <milimetric>	 yeah, we could try to map that, but it's localized starting in 2007 and not localized before that, etc.  And I believe even the mapping from string to int changes over time
[17:59:17] <mforns>	 aha
[17:59:27] <milimetric>	 but we could check those assumptions
[17:59:38] <mforns>	 the localization I think we can fix
[17:59:42] <milimetric>	 take a look at dewiki which should have gotten any fancy new things like localization early
[18:00:10] <milimetric>	 but then, yeah, if the mapping changes over time, we'd have to rebuild the git history of the preferences files (if those are even checked into git (which I think they are))
[18:00:18] <milimetric>	 #EnglishError - nested parens
[18:00:53] <mforns>	 aha
[18:01:50] <mforns>	 milimetric, but maybe we can go now with simplewiki and enwiki and then figure out how to get the ns map for all other wikis
[18:01:51] <mforns>	 no?
[18:02:22] <milimetric>	 I'd want to check to make sure it's possible before spending time down that path though
[18:02:55] <milimetric>	 (one moment, I'm having trouble with my mobile plan now)
[18:03:23] <mforns>	 yes makes sense
[18:04:58] <wikibugs>	 Quarry: Add CORS or JSONP support to /query/:id/result/latest/0/json endpoint - https://phabricator.wikimedia.org/T139197#2422268 (MusikAnimal)
[18:05:07] <milimetric>	 oh snap! I have tons of data and only 4 days on my cycle.  Ok, let's hangout mforns :)
[18:05:23] <mforns>	 milimetric, ok :]
[18:11:50] <milimetric>	 mforns: https://github.com/wikimedia/operations-mediawiki-config
[18:25:22] <wikibugs>	 Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#2422318 (Sadads) @Milimetric & @Legoktm did we ever figure out what the blocker was on the Ev...
[18:34:10] <grrrit-wm>	 (PS1) Yuvipanda: Support CORS for the latest result redirect [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296950
[18:35:05] <grrrit-wm>	 (PS2) Yuvipanda: Support CORS for the latest result redirect [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296950 (https://phabricator.wikimedia.org/T139197)
[18:36:10] <grrrit-wm>	 (PS3) Yuvipanda: Support CORS for the latest result redirect [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296950 (https://phabricator.wikimedia.org/T139197)
[18:37:09] <wikibugs>	 Quarry, Patch-For-Review: Add CORS or JSONP support to /query/:id/result/latest/0/json endpoint - https://phabricator.wikimedia.org/T139197#2422354 (yuvipanda) @MusikAnimal can you try now?
[18:46:19] <grrrit-wm>	 (CR) Yuvipanda: [C: 2] Support CORS for the latest result redirect [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296950 (https://phabricator.wikimedia.org/T139197) (owner: Yuvipanda)
[18:46:54] <wikibugs>	 Quarry, Patch-For-Review: Add CORS or JSONP support to /query/:id/result/latest/0/json endpoint - https://phabricator.wikimedia.org/T139197#2422392 (yuvipanda) Still doesn't work because there's a https -> http redirect in there for some reason.
[18:53:59] <nuria_>	 elukey: yt?
[18:56:07] <grrrit-wm>	 (PS1) Yuvipanda: Slightly stronger user authentication check [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296952 (https://phabricator.wikimedia.org/T134699)
[18:56:09] <grrrit-wm>	 (PS1) Yuvipanda: Make /latest/ redirect use https explicitly [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296953 (https://phabricator.wikimedia.org/T139197)
[18:56:18] <grrrit-wm>	 (PS2) Yuvipanda: Slightly stronger user authentication check [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296952 (https://phabricator.wikimedia.org/T134699)
[18:56:36] <grrrit-wm>	 (CR) Yuvipanda: [C: 2 V: 2] Slightly stronger user authentication check [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296952 (https://phabricator.wikimedia.org/T134699) (owner: Yuvipanda)
[18:56:55] <grrrit-wm>	 (PS2) Yuvipanda: Make /latest/ redirect use https explicitly [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296953 (https://phabricator.wikimedia.org/T139197)
[18:57:06] <grrrit-wm>	 (CR) Yuvipanda: [C: 2 V: 2] Make /latest/ redirect use https explicitly [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296953 (https://phabricator.wikimedia.org/T139197) (owner: Yuvipanda)
[19:01:50] <grrrit-wm>	 (PS1) Yuvipanda: Fix typo [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296955
[19:02:02] <grrrit-wm>	 (PS2) Yuvipanda: Fix typo [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296955
[19:02:07] <grrrit-wm>	 (CR) jenkins-bot: [V: -1] Fix typo [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296955 (owner: Yuvipanda)
[19:02:26] <grrrit-wm>	 (PS3) Yuvipanda: Fix typo [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296955 (https://phabricator.wikimedia.org/T139197)
[19:02:35] <grrrit-wm>	 (CR) Yuvipanda: [C: 2 V: 2] Fix typo [analytics/quarry/web] - https://gerrit.wikimedia.org/r/296955 (https://phabricator.wikimedia.org/T139197) (owner: Yuvipanda)
[19:05:23] <wikibugs>	 Quarry, Patch-For-Review: Add CORS or JSONP support to /query/:id/result/latest/0/json endpoint - https://phabricator.wikimedia.org/T139197#2422464 (MusikAnimal) Working on my end! Many thanks Yuvi! :)
[20:44:25] <wikibugs>	 Analytics-Kanban: Page History: write scala for page history reconstruction algorithm - https://phabricator.wikimedia.org/T138853#2422646 (Milimetric) I was stuck on this but with Marcel's help we created a new table in Hive:      select * from milimetric.namespace_mapping where hostname = 'ja.wikipedia.org'...