[00:46:29] 10Analytics-Kanban, 10Discovery-Analysis, 10MobileApp, 10Wikipedia-Android-App-Backlog: Bug behavior of QTree[Long] for quantileBounds - https://phabricator.wikimedia.org/T184768#3916105 (10Nuria) Here is a version of the script you can execute in the commandline: https://gist.github.com/nuria/157bae27e67... [01:03:59] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3916113 (10Tbayer) >>! In T185350#3913845, @Nuria wrote: ... > it is likely that for media downloaded in chunks the field doesn't reflect file s... [07:29:23] 10Analytics-EventLogging, 10Analytics-Kanban, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3916264 (10elukey) So from https://github.com/dpkp/kafka-python/blob/master/kafka/errors.py it seems that `Error sending GroupCoordinatorReq... [08:57:03] 10Analytics-EventLogging, 10Analytics-Kanban, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3916465 (10elukey) I may not have followed the right trail during the last time, I tried to retrace again an error today. Started from `even... [09:09:46] very interesting, it seems (from my ignorant point of view) that the eventlogging processors are dying due to librdkafka queue getting full [09:11:04] so maybe the consumer does not have time to commit the msg read, the consumer group rebalances and the next processor responsible for the same set of messages will have to grab again the same messages [09:25:10] Hi elukey [09:25:26] elukey: I'd like a bit more explanation to be sure I understand your reasonning [09:33:44] joal: morninggggg [09:35:06] I noticed the other day with Marcel that the el processors are sometimes pushing duplicate events to their output topics [09:35:20] (valid-mixed and every schema topic IIUC) [09:35:49] they use kafka python to consume and kafka confluent to procude [09:35:51] *produce [09:36:18] now what I am seeing in the logs is that at some point, a processor dies due to [09:36:21] kafka_producer.produce(message_topic, message_value, message_key) [09:36:24] BufferError: Local: Queue full [09:36:49] then upstart creates a new one, but at this point the consumer group gets rebalanced [09:37:24] (or maybe it rebalances two times, one when the processor dies and the other one when it gets back) [09:38:00] the rebalance is not an issue (I think), the duplicates are probably generated by the "BufferError: Local: Queue full" [09:38:26] since offset might not have been committed [09:38:52] now I am trying to figure out what are the processor doing to end up in that error state [09:38:58] does it make sense? [09:39:32] elukey: makes more sense - Was not understanding the link with librdkafka :) [09:40:15] ah yes sorry, should have probably have written more words :) [09:40:28] elukey: If I follow you, duplicated data is the one that was in the full queue (sent once from the producer that ended in error, and a second time by the new producedr) [09:41:24] yep exactly [09:41:29] makes sense :) [09:41:50] https://github.com/edenhill/librdkafka/issues/210#issuecomment-134885396 shows some details about why this might happen (from the librdkafka pov) [09:43:34] elukey: one thing I'm still micing is: you said we're using kafka confluent to produce - That means kafka confluent uses librdkafka? [09:46:04] joal: I think that both kafka-python and kafka-confluent-python use it (but not 100% sure) [09:46:25] k [09:46:28] https://github.com/confluentinc/confluent-kafka-python for sure [09:52:18] in https://github.com/confluentinc/confluent-kafka-python/blob/0.9.2.x/confluent_kafka/kafkatest/verifiable_producer.py#L112-L116 there is an interesting error case handling [09:52:53] but this is VerifiableProducer, that might be another thing [09:53:32] anyhow, it is probably an issue with the processor not handling the BufferError failure scenario [10:04:29] 10Analytics-EventLogging, 10Analytics-Kanban, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3916636 (10elukey) @Ottomata do you think that something like the following would be good for the processor's code? https://github.com/conf... [10:05:19] all right I wrote a proposal to --^, let's wait for the Kafka Python master Andrew to see if it makes sense or not :) [10:06:46] awesome elukey :) [10:07:45] joal: http://lkml.iu.edu/hypermail/linux/kernel/1801.2/04628.html [10:26:59] elukey: Linus seems back in the game :) [10:50:36] I created https://gerrit.wikimedia.org/r/#/c/405687/1/eventlogging/handlers.py but not sure if it is good or not [11:36:04] going out for lunch + errand, ttl! [11:36:07] * elukey afk! [13:11:24] Taking a b [13:11:25] reak [14:14:05] (03PS1) 10Fdans: Enable top pageviews by country [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405708 (https://phabricator.wikimedia.org/T175422) [14:17:43] (03PS2) 10Fdans: Enable top pageviews by country [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405708 (https://phabricator.wikimedia.org/T175422) [14:43:37] * elukey just discovered that the Burrow api is awesome [14:43:38] https://github.com/linkedin/Burrow/wiki/HTTP-Endpoint [14:43:40] 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3917427 (10Ottomata) Yeah, this is a very common request, and in the past we've told people we can't do it. Their only recourse then is t... [14:43:49] I mean I can get the list of consumers?? [14:43:53] \o/ [14:47:11] yeah, elukey we shoudl give it some more love sometime i think [14:47:13] maybe update it [14:47:39] milimetric: any thoughts on https://phabricator.wikimedia.org/T185350 - in particular, do you recall what the issues were behind https://wikitech.wikimedia.org/w/index.php?title=Analytics/Archive/Data/Pagecounts-raw&diff=next&oldid=817571 ("wasn't very accurate")? [14:48:09] ottomata: o/ - I can work on https://phabricator.wikimedia.org/T180442 [14:48:35] adding it to a debian package and test it (since it supports now only the v2 api, v3 coming) [14:48:40] ya elukey [14:48:40] https://github.com/jirwin/burrow_exporter [14:48:41] :) [14:48:46] from [14:48:46] https://github.com/linkedin/Burrow/wiki/Associated-Projects [14:48:53] which also has links to some cool UIs [14:49:08] like [14:49:08] https://github.com/GeneralMills/BurrowUI [14:49:43] it would be awesome for the job queues [14:49:48] yeah [14:49:49] agree [14:55:06] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3917458 (10Ottomata) IINnteresting! Could be worth a try for sure. [14:56:13] ottomata: I wasn't sure in --^ if it was better .poll or .flush, it seem weird that the backlog gets filled [14:57:00] elukey: this is in processor, right? [14:57:06] yep [14:57:18] maybe we need more processor processes? [14:57:21] and partitions? [14:57:31] hmm, this is probably the one going to the eventlogging-valid-mixed topic [14:57:36] the produce request [14:57:43] which, would be strange if local buffer is full for that [14:57:45] hmm [14:57:53] it always says the problem is with broker 18 though? [14:58:02] i wonder if there is a problem with that one that is causing produce requests to stall [14:58:08] which causes the local buffer to fill up [14:58:24] elukey: we could also consider upgrading confluent--kafka-pythong [14:58:26] pythong* [14:58:27] gah [14:58:31] python* [14:58:33] :) [14:58:37] :) [14:58:47] its been a while [14:59:16] I think it happens with other brokers, but I'd need to triple check.. (I wondered the same this morning) [14:59:44] 18 is the one that died a while ago, and is now kafka1023, right? [14:59:45] ottomata: one thing that I didn't get from the confluent kafka python docs is if a poll is needed when producing in async [14:59:51] ottomata: exactly [15:00:46] elukey: (just read docs), i guess flush will wait til all messages in the buffer have been acked [15:00:53] which might block things quite a bit [15:01:06] would be fine i think, as long as the processor doesn't start then lagging [15:01:11] it might also be enough to use poll instead [15:01:20] and just make sure that *some* messages were delivered and removed from the queue [15:01:25] if some are removed, then more can be added [15:01:33] or, we could also look into increasing the buffer size? [15:02:46] max.in.flight.requests.per.connection [15:02:47] ? [15:02:59] or [15:03:00] I was wondering the same, maybe flush() is a bit aggressive, poll(0.5) maybe would be better? We could also try to increase the buffer size, but I'd like to have some workaround in place as well in case the limit is hit [15:03:00] queue.buffering.max.messages [15:03:09] I think it is --^ [15:03:09] yeah [15:03:23] as in, if the queue is full, we start blocking and losing throughput, rather than dropping messages [15:03:25] agree [15:03:44] hm, gotta be careful though elukey [15:03:47] i think what you have is good [15:03:52] but in our current state, we just duplicate a lot [15:04:06] because the BufferError exception cause the process to die, and then the respawned one starts from the previous comit [15:04:19] but, if we keep running, we will also continue to commit offsets consumed [15:04:41] so, i think what you have is good, as it should block the process, and messages from eventlogging-client-side won't be consumed in the meantime [15:05:19] default is queue msg size is 1048576 [15:05:22] hm [15:07:14] elukey: i'm not sure how it is hitting that message size, unless a broker is taking a really long time to ack [15:07:19] what if after kafka_producer.produce a kafka_producer.poll(0.001) is needed? [15:07:26] or with async is not needed? [15:07:37] elukey: we'd only poll at all if we wanted to get some kind of delivery report status [15:07:54] or if w wanted to block for queue reasons [15:08:14] elukey: i don't htink we are doing async... [15:08:15] are we? [15:08:43] oh [15:08:45] default is True [15:08:45] cool [15:08:46] we are :) [15:09:02] elukey: yeah, so it is not needed [15:09:12] the way this kafka async producer in eventlogging currently works [15:09:15] is fire and forget about ack [15:10:00] if we wanted to do this really ideally, we'd somehow tie the produce ACK with the consumer offset commit [15:10:16] but that gets a little complicated [15:10:28] yeah [15:10:46] i think trying some short poll after a BufferError like this is a good idea [15:10:50] let's try that before flush [15:11:14] poll 0.5 is probbly fine, as ideally poll will return before 0.5 is up, right? [15:11:59] in theory yes, in return with the number of events delivered or None if the timeout occurred [15:12:02] (iiuc) [15:12:15] s/in/it sorry :) [15:12:19] aye [15:13:23] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install notebook[34] - https://phabricator.wikimedia.org/T183935#3917477 (10Ottomata) Status? [15:13:53] but yeah, elukey it seems to me like we shouldn't be hitting that buffer limit with our current throughput, if everything is operating normally [15:13:58] maybe i'm wrong but [15:14:05] let's say we do 500 msgs / sec [15:14:13] but we ahve 12 partitions and 12 processes [15:14:15] so really [15:14:20] 42 msg/s per sec [15:14:26] (per process and buffer) [15:14:48] (oh sorry, default is 100000, not the other number) [15:14:57] AH [15:15:01] that is a more reasonable number [15:15:08] sorry i looked at the bytes buffer size limit [15:15:23] hmmm [15:15:26] well still [15:15:28] with 42/sec [15:15:47] it'd mean the process would need to be lagging by almost 40minutes, right? [15:16:38] 100000 / 42 == how many seconds it would take if a producer was stalled to fill up the buffer == 2380.95 secnds == 39.68 minutes? [15:17:09] the calculation looks ok, I am wondering if we would get an alarm in this case [15:17:20] OHHHH [15:17:27] but the bytes per second is much lower time [15:17:38] about 350 KB / sec [15:17:41] in eventlogging-client-side [15:17:58] AND hm, each producer will double that! because most messages are produced twice [15:18:05] evnetlogging-valid-mixed and eventlogging_ [15:18:29] 1048576/(700*1024/12) [15:18:50] $limit / (bytes per second / # processes) == 17.55 seconds [15:19:32] elukey: i could see that happening [15:19:35] it shoudln't happen [15:19:53] but if some broker has some problem (coordinator change, leader rebalance, dunno), mayyybe a produce request could be stalled for 18 seconds? [15:20:07] hmm, i dunno, that calculation assumes that all produce requests are stalled [15:20:20] which would mean all brokers are stalled? [15:20:26] which seems unlikely [15:22:38] elukey: i think this exception is good [15:22:38] but [15:22:47] if you really wanted to ensure delivery and retry if fail... [15:22:54] you could register a delivery callback with the produce request [15:23:01] and retry if failed [15:23:05] that would wait for the broker ACK [15:23:10] asynchronously [15:23:19] you'd still need this exception for the local enqueue failure though [15:23:27] so maybe that could be done later (if we really want to) [15:23:43] this is a very good point (still trying to digest all the calculations sorry :) [15:25:27] happy to batcave if you want .. :) [15:25:50] 10Analytics, 10Operations, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#3917495 (10faidon) a:05faidon>03RobH Sounds good, please go ahead :) [15:28:04] ottomata: sure! [15:28:08] let me grab a coffee first [15:29:30] k [15:34:25] ok I am in! [15:53:00] milimetric[m]: ^ (T185350).. not sure which of you three is the right one to ping ;) [15:53:00] T185350: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350 [15:54:11] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3917572 (10Tbayer) [16:01:01] ping joal [16:01:16] Ard joining [16:03:15] 10Analytics-EventLogging, 10Analytics-Kanban: Purge refined JSON data after 90 days - https://phabricator.wikimedia.org/T181064#3917591 (10Ottomata) [16:03:17] 10Analytics, 10Analytics-EventLogging: Lookout for duplicates in EL refine - https://phabricator.wikimedia.org/T185237#3917592 (10Ottomata) [16:03:39] 10Analytics, 10Analytics-EventLogging: Lookout for duplicates in EL refine, implement pluggable transform method config in JSONRefine - https://phabricator.wikimedia.org/T185237#3910470 (10Ottomata) [16:03:49] 10Analytics, 10Analytics-EventLogging: Lookout for duplicates in EL refine, implement pluggable transform method config in JSONRefine - https://phabricator.wikimedia.org/T185237#3910470 (10Ottomata) [16:04:04] 10Analytics-EventLogging, 10Analytics-Kanban: Lookout for duplicates in EL refine, implement pluggable transform method config in JSONRefine - https://phabricator.wikimedia.org/T185237#3910470 (10Ottomata) [16:04:53] 10Analytics-Kanban, 10Patch-For-Review: Wikiselector Perf issues on Chrome - https://phabricator.wikimedia.org/T185334#3917604 (10Nuria) a:03Nuria [16:05:06] 10Analytics-EventLogging, 10Analytics-Kanban: Purge refined JSON data after 90 days - https://phabricator.wikimedia.org/T181064#3778325 (10Ottomata) a:05Ottomata>03None [16:06:26] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Services (watching), 10User-Elukey: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3917606 (10Ottomata) a:05Ottomata>03None [16:07:36] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Services (watching), 10User-Elukey: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3917609 (10elukey) a:03elukey [16:11:09] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3917622 (10Nuria) >What is the argument for assuming that this data is correct? The data for the pdf file transfer looks as you would expect if... [16:25:51] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3917729 (10Tbayer) >>! In T185350#3917622, @Nuria wrote: >>What is the argument for assuming that this data is correct? > The data for the pdf f... [16:48:25] 10Analytics, 10Analytics-EventLogging: Archive and drop the MobileOptionsTracking EventLogging MySQL table - https://phabricator.wikimedia.org/T185339#3913393 (10Nuria) Table to be scooped and removed from mysql. [16:49:54] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2: New Pages split by editor type wrongly claims no anonymous users create pages - https://phabricator.wikimedia.org/T185342#3917771 (10Nuria) [16:50:55] 10Analytics: CamusPartitionChecker does not work when topic names have '.' or '-' in them. - https://phabricator.wikimedia.org/T171099#3917773 (10Nuria) [16:51:33] 10Analytics, 10Analytics-Cluster: CamusPartitionChecker does not work when topic names have '.' or '-' in them. - https://phabricator.wikimedia.org/T171099#3454060 (10Nuria) [16:52:29] 10Analytics, 10Analytics-Cluster: CamusPartitionChecker does not work when topic names have '.' or '-' in them. - https://phabricator.wikimedia.org/T171099#3454060 (10Nuria) Kafka does not care but camus changes "." -> "_" Camus partition checker should do the same that camus does. [16:58:11] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/405363 (https://phabricator.wikimedia.org/T185344) (owner: 10Joal) [16:59:13] (03CR) 10Fdans: [C: 032] Use WikimediaUI color palette in header [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/398459 (owner: 10Ladsgroup) [16:59:43] elukey: ok for me to deploy? [17:00:00] joal: ack! [17:00:04] thanks :) [17:01:45] hey fdans, one clarification - once QuickSurveyInitiation/QuickSurveysResponses' whitelist change gets merged the eventlogging cleaner will start nullifing fields as opposed to dropping, but it will not sanitize the previous data [17:01:56] this will need to be done manually [17:02:12] with an UPDATE blablabla [17:11:55] elukey: ohhh, so the purging script would not run once the whitelist changes ? That's what I understood from nuria_ [17:12:14] (that is, run against old data) [17:12:29] !log Refinery deployed from scap [17:12:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:12:41] !log deploying refinery onto HDFS [17:12:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:12:54] fdans: it will not, it usually runs from the last committed timestamp, that is ~1 day before [17:13:08] sorry I was distracted and probably didn't hear that :( [17:13:13] I seee [17:13:15] fdans, elukey : let's see, how old is the oldest survey data? [17:13:35] but it shouldn't be a big deal to manually clean [17:13:44] elukey: i see, understood, my mistake. [17:13:47] elukey: right [17:14:13] nuria_: I should have paid attention during standup when Francisco was asking, I thought I did but probably not, my fault :( [17:14:37] it's my fault for never saying anything interesting elukey :D [17:15:27] fdans: you were talking about the eventlogging cleaner! It is a super awesome project :P [17:15:43] so I will ping you once leila approves of the change elukey (since I don't have permissions to alter the tables) [17:16:21] fdans: how old is the oldest survey data? [17:16:24] fdans: ack, I'll prepare the update and run them.. will be better to stop the eventlogging mysql consumer when we do it on the master just to be safe [17:16:40] fdans: some of it must have been sanitized alreday by prior passes of script [17:17:46] nuria_: I was inspecting some data earlier from may 2017, let me see... [17:17:55] (which wasn't sanitized) [17:18:23] nuria_: so 20170515131236 for log.QuickSurveysResponses_15266417 [17:18:57] elukey: wouldn't that had been sanitized by prior runs of the script? [17:19:09] 20160217001022 for the other [17:19:13] fdans: or does whitelist whitelist every file? [17:19:59] *every field? [17:20:14] nuria_: the whitelist already whitelist those tables afaik, including the fields that we are removing [17:20:29] elukey: i see, ok [17:29:31] 10Analytics-Kanban: Launch top per country pageviews on UI - https://phabricator.wikimedia.org/T185510#3917834 (10fdans) [17:30:26] (03PS3) 10Fdans: Launch top pageviews by country [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405708 (https://phabricator.wikimedia.org/T185510) [17:36:01] !log Kill-Restart clickstream oozie job after deploy [17:36:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:42:27] elukey: no-problem deploy - Thanks for having allowed :) [17:42:38] elukey: or at least, no problem YET [17:42:50] exactly! :P [17:43:00] Ok gone for diner team, back after [19:29:30] (03CR) 10Joal: [C: 032] Clean refinery-job from BannerImpressionStream job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405285 (owner: 10Joal) [19:37:47] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3918246 (10Nuria) >Also, note in both cases (response_size 18550 and 15722), these weren't partial requests either (the status code was 200, not... [19:42:02] (03Merged) 10jenkins-bot: Clean refinery-job from BannerImpressionStream job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405285 (owner: 10Joal) [19:58:47] 10Analytics: Make Wikipedia clickstream dataset available as API - https://phabricator.wikimedia.org/T185526#3918334 (10mpopov) [19:58:50] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3918346 (10Tbayer) @ottomata notes that the response_size field should correspond to the "Size of response in bytes, excluding HTTP headers" out... [20:01:43] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3918360 (10Ottomata) Thanks @Tbayer, meant to write that here too :) I'd be very surprised if the value of `%b` from varnishlog which is sent a... [20:06:06] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3918374 (10Tbayer) In the meantime, I ran a query to estimate how much data was transferred in the download direction last month overall *if* th... [20:36:36] wow scala sometimes [20:36:48] abstract def [20:36:49] existentialAbstraction(tparams: http://www.scala-lang.org/api/2.10.5/scala/package.html#List%5B+A%5D=List%5BA%5D[http://www.scala-lang.org/api/2.10.5/scala/reflect/api/Universe.html#Symbol%3E:Null%3C:Symbols.this.SymbolApi], tpe0: http://www.scala-lang.org/api/2.10.5/scala/reflect/api/Universe.html#Type%3E:Null%3C:Types.this.TypeApi): http://www.scala-lang.org/api/2.10.5/scala/reflect/api/Universe.html#Type%3E:Null%3C:Ty [20:36:49] pes.this.TypeApi [20:36:49] A creator for existential types. [20:36:57] oops, didn't mean to paste URIs [20:37:01] abstract def [20:37:02] existentialAbstraction(tparams: List[Universe.Symbol], tpe0: Universe.Type): Universe.Type [20:37:02] A creator for existential types. [20:37:15] ottomata: existentialism is always somehow complex ;) [20:37:34] maybe scala is what god uses to program [20:38:06] ottomata: While I love comparing myself to god, I unfortunaely think the opposite is not true :) [20:38:12] hahaha, i mean [20:38:24] if you can use a Universe to create an existential abstraction [20:38:26] sounds pretty powerful [20:38:32] agreed [20:38:41] ottomata: in scala myself, so feeling joker [20:38:57] ottomata: CR for move to spark2 should come soon [20:39:03] ottomata: Not tested however [20:39:16] i can get you a test command if you want to try joal [20:39:17] ottomata: with small refactors along the way [20:39:25] ottomata: Would be great :) [20:45:20] ERROR | wrapper | 2018/01/21 05:01:51 | JVM appears hung: Timed out waiting for signal from JVM. [20:45:21] oops [21:06:31] (03PS1) 10Joal: Refactor json-refine and json-to-Druid for spark2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405770 [21:11:20] (03CR) 10jerkins-bot: [V: 04-1] Refactor json-refine and json-to-Druid for spark2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405770 (owner: 10Joal) [21:21:24] (03PS1) 10Ottomata: [WIP] Add configurable transform function to JSONRefine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405800 (https://phabricator.wikimedia.org/T185237) [21:21:48] joal: ^ :) [21:21:57] time for nastyness [21:22:34] still gotta fix CLI opts [21:23:09] (03PS2) 10Ottomata: [WIP] Add configurable transform function to JSONRefine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405800 (https://phabricator.wikimedia.org/T185237) [21:50:00] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3913807 (10faidon) >>! In T185350#3918374, @Tbayer wrote: > In the meantime, I ran a query to estimate how much data was transferred in the down... [22:01:59] (03PS2) 10Joal: Refactor json-refine and json-to-Druid for spark2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405770 [22:05:35] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3918834 (10Tbayer) OK, I just launched the below query for ulsfo - will report the result here once it has completed. ```lang=sql SELECT SUM(... [22:05:38] (03PS3) 10Joal: Refactor json-refine and json-to-Druid for spark2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405770 [22:05:43] 10Analytics, 10Analytics-Wikistats, 10Accessibility: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#3918835 (10Volker_E) [22:06:49] mwarf ottomata - looks like archiva is gone again :( [22:09:28] (03CR) 10jerkins-bot: [V: 04-1] Refactor json-refine and json-to-Druid for spark2 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405770 (owner: 10Joal) [22:10:23] 10Analytics, 10Analytics-Wikistats, 10Accessibility: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#3918855 (10Volker_E) https://gerrit.wikimedia.org/r/#/c/398459/ has aimed at taming a few color issues. IMO making use of [[ https://www.npmjs.com/package/wikimedi... [22:21:41] yea.... [22:21:52] again [22:21:53] INFO | jvm 1 | 2018/01/22 22:00:31 | [INFO] Failed to parse Maven artifact /var/lib/archiva/repositories/mirrored/org/apache/maven/maven-plugin-api/3.0.2/maven-plugin-api-3.0.2.jar due to error in opening zip file [22:21:53] hm [22:22:12] ottomata: I did something wrong maybe? [22:23:41] i dunno joal [22:23:42] not sure [22:23:43] don't think so [22:23:46] try again now [22:24:09] Failed to execute goal on project refinery-job-spark-2.1: Could not resolve dependencies for project org.wikimedia.analytics.refinery.job:refinery-job-spark-2.1:jar:0.0.58-SNAPSHOT: Failure to find org.scalamock:scalamock-scalatest-support_2.11:jar:3.6.0 in https://archiva.wikimedia.org/repository/mirrored/ was cached in the local repository, resolution will not be reattempted until the update [22:24:15] interval of system-wide-wmf-mirrored-default has elapsed or updates are forced [22:24:32] hmmm, i didn't get any request this time [22:24:41] oh [22:24:46] from stat1004 [22:24:56] yeah, but i think your maven there has cached the response [22:24:59] from when it was broken [22:25:08] :( [22:25:09] remove the offender from .m2? [22:25:12] your ~/.m2 [22:25:18] k [22:26:22] hm [22:26:22] GET /repository/releases/org/scalamock/scalamock-scalatest-support_2.11/3.6.0/scalamock-scalatest-support_2.11-3.6.0.jar HTTP/1.0" 404 [22:26:34] indeed 1 [22:26:38] Was about to paste [22:26:45] :( [22:27:01] joal it worked when i bounced archiva before? [22:27:05] i'm going to do that again and watch logs [22:27:21] try again joal [22:27:33] I think it did ottomata - But can't say really since I remove using archiva for aminute, therefore had cached jar [22:27:59] same [22:28:15] hm still 404 [22:28:32] but not much info as to why.. [22:29:25] i'm going to recreate the proxy connecto... [22:30:14] joal try again for me [22:30:24] Started [22:30:36] hm still 404 [22:30:38] ooook... [22:30:42] going to recreate the remote repo [22:30:59] ottomata: do you wish me to change the version or something like that, o test? [22:31:40] sure? [22:32:34] Trying with version 3.5 [22:32:46] same old ottomata - sorry [22:37:22] yeah, i think something is wrong here...maybe archiva can't contact maven central anymore [22:37:29] strange that it can for cloudera though... [22:43:28] joal i dunno what's going on right now [22:43:41] but i can see that archiva is getting 500s when it tries to contact central [22:43:55] ottomata: and now, it WOOOOOORKS ... [22:43:59] ottomata: :S [22:44:18] wow and then [22:44:19] ERROR | wrapper | 2018/01/22 22:44:10 | JVM appears hung: Timed out waiting for signal from JVM. [22:44:19] ERROR | wrapper | 2018/01/22 22:44:10 | JVM did not exit on request, terminated [22:44:20] INFO | wrapper | 2018/01/22 22:44:10 | JVM exited on its own while waiting to kill the application. [22:44:20] STATUS | wrapper | 2018/01/22 22:44:11 | JVM exited in response to signal SIGKILL (9). [22:44:22] wait what? [22:44:23] now it works? [22:44:41] i did change the URI [22:44:42] to https://repo1.maven.org/maven2/ [22:44:57] ottomata: it worked better (downloaded some stuff - but still failed later) [22:45:04] hmm [22:45:08] Return code is: 502 , ReasonPhrase:Bad Gateway [22:45:11] wow [22:45:13] /repository/mirrored/org/scalamock/scalamock-scalatest-support_2.11/3.6.0/scalamock-scalatest-support_2.11-3.6.0.pom HTTP/1.0" 200 [22:45:22] archiva died [22:45:25] pom worked, jar didn't [22:45:25] try again joal [22:45:42] Maaaaan - I'm super sorry for that mess [22:46:17] hmm, i don't think it is your fault [22:46:18] hmmm [22:46:26] i can see the jvm hanging on a futex.. [22:46:45] ottomata: no download problem anymore :) [22:46:48] ottomata: that one is fized [22:46:50] fixed [22:46:54] it downloaded? [22:47:04] ottomata: However I'm back with something else funny: org/scalamock/clazz/MockImpl$ : Unsupported major.minor version 52.0 [22:47:08] uh oh [22:47:15] but not a maven problem? [22:47:16] :) [22:47:19] Worked at home since I use j8 [22:47:22] correct [22:47:25] joal, you can use J8 [22:47:33] you are on 1004? [22:47:36] set java home to [22:47:38] yes [22:47:52] export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ [22:47:54] probably? [22:48:00] hopefully maven will pick that up [22:48:45] Unsupported major.minor version 52.0 [22:48:49] MAAAAHHHH ! [22:48:59] * joal is angry at java [22:51:07] back to maven download issues [22:51:15] hm - I think I'm gonna stop that for tonight [22:52:55] ok [22:53:01] joal dunno, let's work on it tomorrow [22:53:06] sure [22:53:06] archiva /maven should work [22:53:12] Thanks again [22:58:14] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3919049 (10Tbayer) @faidon: The ulsfo result for December is 2018 (decimal) terabytes. Plausible? ``` total_bytes requests 2017852005519238... [23:06:45] 10Analytics-Data-Quality, 10Operations, 10Traffic: Vet reliability of the response_size field for data analysis purposes - https://phabricator.wikimedia.org/T185350#3919078 (10faidon) Interesting! So with a ratio in:out of approximately 25:1 (based on January's figures), this means that we could estimate the... [23:11:57] ottomata: actually managed to have it working :) [23:12:28] one thing fails: creating a table with name: `db.table` and not `db`.`table` [23:12:47] Except from that, looks like it worked :) [23:18:29] ottomata: This error I get is actually failrly interesting, since it seems to affect both spark1 and 2 - Let's talk about that tomorrow :) [23:18:33] Have a good night a-team [23:35:30] 10Analytics, 10Code-Stewardship-Reviews, 10Operations, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319#3919182 (10greg) p:05Triage>03Normal [23:41:31] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Explain decrease in number of patchset authors for same time span when accessed 3 months later - https://phabricator.wikimedia.org/T184427#3919195 (10Aklapper) p:05Normal>03Low