[06:04:57] <elukey>	 joal: o/
[06:05:00] <elukey>	 morninggggg
[06:09:31] <elukey>	 if you are ok I'd proceed with https://gerrit.wikimedia.org/r/#/c/429429/ (applying CMS to the namenodes)
[07:21:30] <joal>	 Hi elukey - Let's go for it!
[07:23:01] <elukey>	 \o/
[07:23:06] <elukey>	 just tested it in labs, all good
[07:23:52] <elukey>	 so plan is: merge, restart namenode on an1002, check metrics for a couple of hours, do the same with an1001
[07:30:36] <elukey>	 ok merged, now restarting namenode on an1002
[07:33:41] <elukey>	 restarted
[07:33:51] <joal>	 watching
[07:35:36] <joal>	 elukey: we haz new metrics :)
[07:37:09] <joal>	 elukey: tis patch doesn't bump the heap, right?
[07:38:29] <joal>	 elukey: sonething weird as well: number of corrupted blocks
[07:39:06] <elukey>	 joal: nope I added only the new settings as suggested by cloudera 
[07:39:16] <elukey>	 the corrupted block thing might be a weirdness in jmx
[07:39:28] <joal>	 k
[07:40:03] <elukey>	 I need to check if the metric is a total counter or not, but I suspect that sometimes it is not updated 
[07:40:28] <elukey>	 because I rememeber a lot of times in which me and you did the check on hdfs via cli and nothing popped up
[08:07:12] <joal>	 elukey: new GC time is not looking great, he?
[08:07:37] <elukey>	 I was about to write someting :)
[08:07:41] <joal>	 :)
[08:07:57] <joal>	 elukey: gave some time to stabilize, but man, that's not what we would expected I assume
[08:08:34] <elukey>	 so GC Time seems to be more constant, around 6/7s for old gen, that is good afaics (I am monitoring also the JvmPauseMonitor occurrences)
[08:08:49] <elukey>	 what we'd need to avoid is those spikes to 25s that kills us
[08:08:51] <joal>	 elukey: Hahhh
[08:09:05] <joal>	 elukey: you could reimage a couple of workers :)
[08:09:09] <elukey>	 ahhahah
[08:09:23] <elukey>	 GC runs are definitely more
[08:09:45] <elukey>	 but given the nature of CMS it might be a consequence of how it behaves
[08:10:22] <joal>	 elukey: other interesting thing I have noticed: https://grafana-admin.wikimedia.org/dashboard/file/server-board.json?panelId=7&fullscreen&orgId=1&var-server=analytics1002&var-network=eth0&from=now-1h&to=now&refresh=1m
[08:10:54] <elukey>	 yes that one was expected, CMS grabs some threads to run its logic
[08:11:17] <joal>	 elukey: from almost nothing to ~8% CPU usage, that's a lot !
[08:13:20] <elukey>	 joal: yeah but I'd say that we can afford to sustain 8% of cpu utilization :D
[08:13:28] <joal>	 agreed :)
[08:14:08] <elukey>	 one thing that we could do is to force a failover to analytics1002
[08:14:15] <elukey>	 and see how timings change
[08:14:55] <joal>	 elukey: +1
[08:15:29] <elukey>	 all right doing it
[08:17:12] <elukey>	 joal: done!
[08:17:57] <elukey>	 ah while we wait for the metrics
[08:18:22] <elukey>	 do you remember that a while ago we were wondering how frequent the namenode takes snapshots of its metadata?
[08:18:29] <elukey>	 I think that we have the default, 1h
[08:18:53] <joal>	 elukey: Interesting!
[08:19:12] <joal>	 elukey: You confirm the passive namenode does that, not the active one, right?
[08:19:24] <elukey>	 yep it seems so
[08:19:24] <elukey>	 -rw-r--r-- 1 hdfs hdfs 1806544161 Apr 30 06:28 fsimage_0000000001894266265
[08:19:28] <elukey>	 -rw-r--r-- 1 hdfs hdfs         62 Apr 30 06:28 fsimage_0000000001894266265.md5
[08:19:31] <elukey>	 -rw-r--r-- 1 hdfs hdfs 1807687684 Apr 30 07:28 fsimage_0000000001894538880
[08:19:34] <elukey>	 -rw-r--r-- 1 hdfs hdfs         62 Apr 30 07:28 fsimage_0000000001894538880.md5
[08:19:45] <elukey>	 and we keep the last two snapshots afaics
[08:20:04] <joal>	 Interesting!
[08:20:09] <joal>	 Thanks for letting me know:)
[08:20:23] <elukey>	 thanks for letting me know how it worked! :D
[08:21:04] <joal>	 elukey: we are like physicians - I study theory and models, you work ground battles :)
[08:22:18] <elukey>	 ahahah
[08:25:35] <elukey>	 joal: I am watching https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=26&fullscreen&orgId=1&from=now-1h&to=now, selecting only an1001/2 old gen metrics
[08:26:10] <joal>	 I was concentraing on this as well
[08:26:16] <joal>	 The pattern is very interesting
[08:26:44] <elukey>	 and also no org.apache.hadoop.util.JvmPauseMonitor occurrence during the past hour
[08:26:55] <elukey>	 (there was one every 10 mins more or less)
[08:27:12] <joal>	 THAT is super awesome :)
[08:27:23] <elukey>	 ( grep org.apache.hadoop.util.JvmPauseMonitor  /var/log/hadoop-hdfs/hadoop-hdfs-namenode-analytics1002.log)
[08:27:35] <joal>	 elukey: look that drop in old GC time for an1001
[08:27:57] <joal>	 I think those high time are due to first-snashot creation
[08:28:26] <elukey>	 might be yes
[08:29:06] <elukey>	 I might be too early but I am happy about CMS
[08:29:16] <joal>	 Another interesting finding: only diff between active/passive modes is disk usage :)
[08:29:25] <joal>	 elukey: goooood :)
[08:29:35] * joal is happy when elukey is happy :)
[08:30:06] <joal>	 https://www.youtube.com/watch?v=d-diB65scQU
[08:30:55] * elukey sings with Joseph
[08:37:29] <elukey>	 joal: I'd say we restart namenode on an1001, wait a bit and then failover again
[08:37:38] <joal>	 works for me elukey :)
[08:37:47] <elukey>	 ack!
[08:41:09] <elukey>	 restarted
[08:43:24] <joal>	 elukey: the pattern match between moving an1002 and an1001 to new GCs is kinda freaky
[08:44:51] <elukey>	 joal: what do you mean?
[08:45:05] <joal>	 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=26&fullscreen&orgId=1&from=1525073612898&to=1525077895920
[08:45:15] <joal>	 elukey: beginning and end of chart
[08:46:14] <elukey>	 ah yes the restart is always a bit intensive on the namenode
[08:47:01] <elukey>	 joal: take also a look to
[08:47:01] <elukey>	 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=58&fullscreen&orgId=1&from=now-3h&to=now
[08:47:31] <joal>	 elukey: nono - I mean, the new GC times between an1002 and an1001 - They are EXACTLY the same !!
[08:48:00] <elukey>	 ah yes! 
[08:48:03] <elukey>	 aahhahah
[08:48:25] <joal>	 Not true for newGen, but for old-gen, man, that is almost scary !!
[08:48:26] <joal>	 :D
[08:50:48] <wikibugs_>	 (03CR) 10Joal: "Done nuria" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429380 (owner: 10Joal)
[08:55:48] <elukey>	 it is weird that an1001's GC timings have not yet recovered
[08:56:24] <joal>	 hm
[08:58:04] <elukey>	 this is the problem of having too many metrics
[08:58:05] <elukey>	 ahahhah
[09:00:06] <joal>	 elukey: Would you mind telling when was the previous snapshot from an1001?>
[09:02:40] <elukey>	 root@analytics1001:/home/elukey# ls -lht /var/lib/hadoop/name/current/fs*
[09:02:43] <elukey>	 -rw-r--r-- 1 hdfs hdfs   62 Apr 30 07:29 /var/lib/hadoop/name/current/fsimage_0000000001894538880.md5
[09:02:46] <elukey>	 -rw-r--r-- 1 hdfs hdfs 1.7G Apr 30 07:29 /var/lib/hadoop/name/current/fsimage_0000000001894538880
[09:02:49] <elukey>	 -rw-r--r-- 1 hdfs hdfs   62 Apr 30 06:29 /var/lib/hadoop/name/current/fsimage_0000000001894266265.md5
[09:02:52] <elukey>	 -rw-r--r-- 1 hdfs hdfs 1.7G Apr 30 06:28 /var/lib/hadoop/name/current/fsimage_0000000001894266265
[09:02:55] <elukey>	 its 9:02 UTC now
[09:03:26] <joal>	 aiement Par Carte 
[09:03:32] <joal>	 wow - sorry about that
[09:03:35] <elukey>	 hahahaahh
[09:03:37] <joal>	 hm
[09:05:25] <elukey>	 from the logs it seems doing its work
[09:06:12] <joal>	 well, let's leave it do then :)
[09:09:23] <elukey>	 it would be nice to know though what the hell it is doing, that horrible GC timing is not good
[09:12:51] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Operations, and 2 others: Kafka API negotiation errors on kafka main brokers - https://phabricator.wikimedia.org/T193238#4167452 (10mobrovac)
[09:19:37] <elukey>	 the only thing that I keep seeing is
[09:19:37] <elukey>	 2018-04-30 09:19:23,431 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
[09:19:43] <elukey>	 for various users
[09:23:38] <elukey>	 (that should be ok)
[09:23:44] <elukey>	 trying jconsole
[09:27:27] <elukey>	 so the oldgen seems 4.2G and completely full
[09:27:43] <elukey>	 so it might be in a GC loop since it keeps trying to clean up memory
[09:28:21] <joal>	 elukey: the patterns in GC time is as if it was trying again and again to do the same thing
[09:31:02] <elukey>	 so from jconsole it is telling me that the GC time is 10s
[09:31:16] <elukey>	 possibly the graph is not plotting the right value
[09:31:18] <elukey>	 increase(jvm_gc_collection_seconds_sum{instance=~"analytics100[12]:10080"}[5m])
[09:31:31] <elukey>	 I now remember that I had to tune this
[09:32:21] <elukey>	 so basically, since we get a counter of total GC time spent, I added a measure about delta for the lowest time window, 5m
[09:32:48] <elukey>	 in any case, Old gen is full
[09:32:53] <elukey>	 lemme check how it is for an1002
[09:33:34] <elukey>	 we could instead plot the rate per second of gc time
[09:33:40] <elukey>	 that might be more accurate
[09:34:08] <elukey>	 old gen heap full even with an1002
[09:34:10] <elukey>	 mmmm
[09:34:35] <joal>	 rate for jvm_gc_collection_seconds_sum gives me 300ms :(
[09:35:12] <elukey>	 yeah that will be time spent per second in GC
[09:38:33] <elukey>	 now I am wondering i f XX:CMSInitiatingOccupancyFraction=70 is not ok for our use case
[09:38:58] <joal>	 elukey: and/or - Maybe 6Gb for a NN is too small?
[09:39:49] <elukey>	 might be.. so I tried to restart the namenode on 1001 again to see if it improves, if not I'd proceed with removing CMSInitiatingOccupancyFraction
[09:40:00] <joal>	 k
[09:43:31] <elukey>	 joal: another thing would be that CMSInitiatingOccupancyFraction=70 + a heap that is not super big (like you were saying) might trigger these weird things
[09:44:09] <joal>	 elukey: That's my idea as well (small heap + try to keep empty = really small portion of space to actually work!)
[09:44:28] <elukey>	 yeah 
[09:44:41] <joal>	 elukey: now, why did it work with an1002????
[09:45:13] <elukey>	 it might be just luck, with an1001 we might have crossed the threshold that triggers the continous GC 
[09:45:25] <elukey>	 (speculations)
[09:47:33] <elukey>	 so now timings are good
[09:47:47] <joal>	 Man - That's weird
[09:48:13] <elukey>	 I am pretty sure it is that CMSInitiatingOccupancyFraction=70
[09:48:37] <joal>	 elukey: I'd give it more RAM instead of removing tat param, don't you htink?
[09:49:07] <elukey>	 joal: I was debating with myself that same thing
[09:49:09] <elukey>	 8g ?
[09:49:10] <joal>	 :D
[09:49:22] <joal>	 yeah, let's start with that
[09:50:30] <elukey>	 done
[09:54:34] <joal>	 elukey: I like that pattern of GC counts now :)
[09:55:07] <joal>	 elukey: let's wait before assuming it's good, but at least the first glimpse of it makes me fel good
[09:55:27] <elukey>	 joal: we were happy before this morning :P
[09:55:39] <joal>	 elukey: I changed on purpose :)
[09:59:25] <joal>	 elukey: would you mind changing the heap size for an1002, swapping active and restart?
[09:59:32] <elukey>	 joal: what do you think aobut doing a failover?
[09:59:34] <elukey>	 ahahhahah
[09:59:41] <elukey>	 yes sure :)
[09:59:43] <joal>	 I'm eager to see those GC counts reduce :)
[10:00:01] <elukey>	 I have already puppetized the GC increase
[10:00:06] <joal>	 awesome
[10:00:27] <elukey>	 an1001 is again the active NN
[10:03:14] <joal>	 elukey: an1001 out of ongoing GC loops I hink :)
[10:03:29] <joal>	 elukey: CPU drop :) Hoorayn
[10:04:58] <elukey>	 \o/
[10:05:00] <joal>	 elukey: on becoming active, a bit more of NewGen GC, but no bump in olgGen - Gooood
[10:05:03] <elukey>	 going to restart 1002 in a bit
[10:05:32] <elukey>	 so another learning today is that NN is really sensitive when we talk about heap space
[10:05:40] <joal>	 elukey: yes, let's releawse this poor CPU of trying to empty a room too small for the thing tio put in it :)
[10:06:10] <joal>	 elukey: not sure I'd put it the way you do - I'd say: NN needs RAM, let's make sure we give ti some
[10:06:52] <elukey>	 joal: yeah, but it worries me a bit to have a dependency between number of files stored and heap space :D
[10:07:00] <joal>	 elukey: why?
[10:07:28] <joal>	 elukey: The purpose of NN being to manage those files, i actaully makes sense doesn't it?
[10:07:50] <elukey>	 because there is no free lunch in my opinion, when the heap grows it also means that other things like GC etc.. might have more challenges in doing their work under pressure
[10:08:01] <elukey>	 it makes sense indeed, it worries me as ops :D
[10:08:04] <joal>	 agreed
[10:08:41] <joal>	 If you want free lunch, please feel free to come home anytime - But it's surely not hadoop-realted ;)
[10:10:13] <joal>	 elukey: an1002 out of CPU4GC loop :)
[10:10:29] <joal>	 elukey: Thanks mate for fine tuning!h
[10:10:43] <elukey>	 \o/
[10:14:58] <joal>	 elukey: now I'll be curious for reall about what happens when you reimage hosts :)
[10:17:36] <elukey>	 will do it this afternoon!
[10:17:56] <elukey>	 I'd say that we could postpone druid's migration to wednesday, to avoid too many inflight things
[10:18:08] <joal>	 elukey: as you wish :)
[10:20:18] <joal>	 elukey: it's also interesting to notice that, before we started to work on NN, heap was 6G and the thing was working, with used-heap ~ max-heap
[10:20:50] <joal>	 elukey: The GC change says: Let's make sure we try to have heap no more than 70% full
[10:22:06] <joal>	 elukey: What makes he thing happy is obviously to move max-heap to 6G (previsouly used heap) / 0.7 (70 percent) ~ 8.5G :)
[10:24:48] <elukey>	 :)
[10:37:40] <elukey>	 joal: going to lunch + errand, I think that everything is stable now
[10:37:53] <elukey>	 thanks a lot!
[10:38:05] <elukey>	 I think the end result is a big win for us :)
[10:38:18] <joal>	 elukey: thank YOU, as usual ;)
[10:50:18] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Add druid datasources as configuration parameter in AQS - https://phabricator.wikimedia.org/T193387#4167623 (10JAllemandou)
[10:50:21] <wikibugs_>	 (03PS1) 10Joal: Add a config param for druid datasources [analytics/aqs] - 10https://gerrit.wikimedia.org/r/429765 (https://phabricator.wikimedia.org/T193387)
[11:11:54] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Index by-snapshot mediawiki-hsitory-reduced in druid - https://phabricator.wikimedia.org/T193388#4167661 (10JAllemandou)
[11:12:40] <wikibugs_>	 (03PS3) 10Joal: Add optional datasource to druid loading workflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/405053
[11:12:42] <wikibugs_>	 (03PS1) 10Joal: Add snapshot to datasource-name (mw hist reduced) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429770 (https://phabricator.wikimedia.org/T193388)
[12:20:51] <wikibugs_>	 (03PS1) 10QChris: Add .gitreview [analytics/wmde/WDCM-Biases-Dashboard] - 10https://gerrit.wikimedia.org/r/429782
[12:20:54] <wikibugs_>	 (03CR) 10QChris: [V: 031 C: 032] Add .gitreview [analytics/wmde/WDCM-Biases-Dashboard] - 10https://gerrit.wikimedia.org/r/429782 (owner: 10QChris)
[12:27:03] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update ua-parser package. Both uap-java and uap-core - https://phabricator.wikimedia.org/T192464#4167805 (10fdans) a:05fdans>03Nuria
[12:31:59] * joal loves the effect of the new heap/gc conf on namenodes :)
[12:58:03] <elukey>	 it looks indeed really stable and efficient, gooooooood
[13:03:21] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Hadoop HDFS Namenode shutdown on 26/04/2018 - https://phabricator.wikimedia.org/T193257#4167828 (10elukey) CMS settings alone were not sufficient to make everything work, when we restarted the namenode daemon on analytics1001 (that was the standby at the t...
[13:05:43] <wikibugs_>	 10Analytics: Varnishkafka does not play well with varnish 5.2 - https://phabricator.wikimedia.org/T177647#4167830 (10elukey) Thanks a lot for all the research work, I am following your progress and I'll try to dedicate some time in reviewing the code during this month!
[13:07:54] <elukey>	 joal: whenever you are back online, if you feel adventurous we can try to upgrade druid analytics :)
[13:08:21] <elukey>	 in the meantime, I am going to reimage two hadoop workers to see how it goes with the new GC settings
[13:09:07] <elukey>	 and the winners are.. 1050 and 1049 :)
[13:11:09] <ottomata>	 hoi
[13:11:34] <elukey>	 ottomata: o/
[13:11:38] <elukey>	 watch 'ps aux | grep yarn | grep -c -v grep' is really awesome
[13:11:39] <elukey>	 :)
[13:14:03] <ottomata>	 :)
[13:14:39] <elukey>	 so it seems that we found a good GC setting for the namenode
[13:15:24] <elukey>	 but to make it work properly we had to go to Xmx/Xms 8G 
[13:15:48] <elukey>	 heap consumption stayed around 6G as it was before, so I think we'll be fine for a while
[13:18:38] <ottomata>	 ok great
[13:18:39] <ottomata>	 sounds good
[13:18:47] <ottomata>	 we can probably go more than 8G now if you think that would be wise
[13:21:29] <elukey>	 nono for the moment I think we are fine, but let's keep an eye on those GC metrics in the future
[13:21:45] <ottomata>	 k
[13:21:54] <ottomata>	 (fyi am beginning kafka200[23] reimage)
[13:22:00] <elukey>	 nice!
[13:22:23] <elukey>	 did you see https://phabricator.wikimedia.org/T193238 ?
[13:22:46] <elukey>	 now statsv has 0.9 hardcoded basically
[13:23:17] <wikibugs_>	 10Analytics: Varnishkafka does not play well with varnish 5.2 - https://phabricator.wikimedia.org/T177647#4167903 (10R4q3NWnUx2CEhVyr) Thanks, however we are hitting an issue where with varnish 5.2 we are getting Segfaults... We tried two paths: 1. porting the changes in VSL/VSM APIs 2. changing to VUT  both get...
[13:23:52] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4167904 (10Ottomata)
[13:24:06] <ottomata>	 elukey:  i did
[13:24:16] <ottomata>	 sounds great, i'll add it to list of things to change for main upgrade
[13:24:23] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4167905 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1049.eqiad.wmnet', 'an...
[13:24:24] <elukey>	 ack
[13:25:43] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Services (watching): Upgrade Kafka on main cluster with security features - https://phabricator.wikimedia.org/T167039#4167912 (10Ottomata)
[13:26:33] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4167916 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['ka...
[13:29:49] <fdans>	 ottomata: helloooo what do we do with EL's tests?
[13:30:09] <fdans>	 they are really _testing_ my patience :P
[13:31:04] <ottomata>	 hmm, i think i saw someone submit something..
[13:31:17] <ottomata>	 fdans:  https://gerrit.wikimedia.org/r/#/c/429651/
[13:31:27] <ottomata>	 for your patch we can skip jenkins
[13:31:30] <ottomata>	 but we should get it fixed
[13:31:32] <ottomata>	 maybe that patch will do it?
[13:31:36] <ottomata>	 not sure, it got a -1 too
[13:32:36] <wikibugs_>	 (03CR) 10Milimetric: [V: 032 C: 032] "nice" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/429380 (owner: 10Joal)
[13:34:32] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4167936 (10Ottomata)
[13:35:50] <fdans>	 ottomata: my patch also included that fix, now there's a different error when importing the test suite in setup.py
[13:36:00] <ottomata>	 oh yours does the same?
[13:48:15] <joal>	 elukey: Heya
[13:48:36] <elukey>	 o/
[13:48:37] <wikibugs_>	 10Analytics, 10Analytics-Wikistats: Present a page view metric description to the user that they are likely to understand - https://phabricator.wikimedia.org/T182109#4167967 (10Milimetric)
[13:49:11] <joal>	 elukey: How has it been going on the reimage side?
[13:50:07] <joal>	 elukey: guessing from https://grafana-admin.wikimedia.org/dashboard/db/analytics-hadoop?panelId=26&fullscreen&orgId=1&from=now-1h&to=now
[13:50:11] <elukey>	 joal: about 30m to be completed
[13:50:30] <joal>	 elukey: if you want, I'm happy for druid later
[13:51:03] <elukey>	 joal: ack! Maybe in ~30m?
[13:51:15] <joal>	 +1 elukey
[13:58:36] <elukey>	 joal: I am thinking a bit better about timings: if we don't care about Banner Impressions, we can stop it and then do the upgrade in a short timeframe, otherwise we'd need to wait one hour to allow each middlemanager to "Drain" from real time tasks
[13:58:54] <elukey>	 and there are a lot of meetings :)
[14:01:06] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4167988 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka2002.codfw.wmnet'] ```  and were **ALL** succ...
[14:02:42] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4167990 (10Ottomata)
[14:02:57] <icinga-wm>	 PROBLEM - Hadoop DataNode on analytics1050 is CRITICAL: NRPE: Command check_hadoop-hdfs-datanode not defined
[14:03:18] <elukey>	 known, reimainging --^
[14:03:58] <elukey>	 added some downtime
[14:07:56] <ottomata>	 fdans:  sorry, one more thing about https://gerrit.wikimedia.org/r/#/c/428390/7/modules/geoip/files/archive.sh  why the "./" in cp -rl "$MAXMIND_DB_SOURCE_DIR/." ?
[14:09:10] <fdans>	 ottomata: hm, I thought it was necessary, but I guess not with -rl
[14:09:17] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4168012 (10Milimetric) I'm all for renaming to Geoeditors, but at this point it does involve a bunch of busy work, so I want to make sure everyone agrees, including Asaf.  > Talking about over-nesting, there is also https://wik...
[14:09:56] <joal>	 elukey: no prob for stopping job
[14:10:33] <ottomata>	 fdans not sure what it would do?
[14:10:43] <ottomata>	 the .
[14:11:10] <ottomata>	 (if it were rsync) i would just end both args in /
[14:11:12] <ottomata>	 since it doesnt' hurt
[14:11:23] <ottomata>	 and generally means to copy the content, but will also create $CURRENT_DIR if it doesnt' exist
[14:11:49] <ottomata>	 not sure if it has a different meaning without / with cp, but it does with rsync
[14:11:50] <ottomata>	 so i just use it
[14:11:59] <ottomata>	 i'd end both args with '/', but i dont' know what the 
[14:12:02] <ottomata>	 '.' would do
[14:13:19] <fdans>	 ottomata: I was copying all the contents of the directory with cp -R /path/ro/dir/. /path/to/newdir
[14:13:26] <fdans>	 but I think that's OSX's cp
[14:13:45] <ottomata>	 do explicitly do content only, i'd seen '/*', but not '/.'
[14:14:00] <ottomata>	 . afaik always refers to current dir
[14:14:04] <ottomata>	 (could be wrong though)
[14:14:35] <fdans>	 ottomata: soooo i remove the dot?
[14:15:02] <ottomata>	 fdans: 
[14:15:02] <ottomata>	 https://askubuntu.com/questions/86822/how-can-i-copy-the-contents-of-a-folder-to-another-folder-in-a-different-directo
[14:15:07] <ottomata>	 i guess i just don't know about it!
[14:15:40] <ottomata>	 let's leave it!
[14:16:10] <ottomata>	 fdans:  you've tested this script as is (with different args passed in on CLI) on stat1005 or somewhere?
[14:17:56] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4168038 (10Milimetric) > * It's worth documenting how distinct anonymous edits are counted. I assumed it was (user agent, IP) pairs, but it looks like it's just by IP.  Done  > * I suggest renaming whole dataset to something li...
[14:18:14] <elukey>	 joal: I've disabled notifications for the banner impression alert, if you want we can stop banner impression
[14:18:23] <elukey>	 I guess disable the cron on an1003 and kill the job?
[14:18:24] <joal>	 elukey: moving forward now?
[14:18:30] <elukey>	 yeah let's do it
[14:18:33] <joal>	 correct elukey 
[14:18:36] <joal>	 ok
[14:18:48] <joal>	 elukey: I let you the cron, I do yarn
[14:19:01] <elukey>	 ack
[14:19:11] <wikibugs_>	 10Analytics, 10Analytics-Dashiki, 10Analytics-Kanban: Add pivot parameter to tabular layout graphs - https://phabricator.wikimedia.org/T126279#4168043 (10Milimetric)
[14:19:14] <joal>	 elukey: I'm assuming it also needs a puppet manual steop, no?
[14:19:20] <joal>	 To make sure cron stays disabled
[14:20:15] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4168050 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['ka...
[14:20:37] <elukey>	 joal: I disabled puppet on the host
[14:20:42] <joal>	 k
[14:21:19] <joal>	 !log Kill BannerImpressionStream job before upgrading druid
[14:21:20] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:22:02] <joal>	 elukey: I also need to pause druid hourly loading jobs
[14:22:05] <joal>	 elukey: doing so
[14:22:57] <elukey>	 super
[14:23:33] <elukey>	 !log disabled cron/check on analytics1003 to respawn banner impressions if needed
[14:23:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:23:37] <joal>	 !log Suspend webrequest-druid-hourly-coord and pageview-druid-hourly-coord before druid upgrade
[14:23:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:24:04] <elukey>	 joal: in the meantime, two new workers again in service
[14:24:33] <elukey>	 gc timings for the namenode are awesome :)
[14:24:42] <joal>	 elukey: checking metrics in NN while they get back in - looks super great indeed
[14:25:10] <icinga-wm>	 RECOVERY - Hadoop DataNode on analytics1050 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode
[14:25:56] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4168060 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1049.eqiad.wmnet', 'analytics1050.eqiad.wmnet'] ```  and were **ALL** su...
[14:29:59] <elukey>	 joal: ready to upgrade druid1001
[14:30:05] <elukey>	 I'll follow http://druid.io/docs/0.10.0/operations/rolling-updates.html
[14:30:19] <joal>	 elukey: ack!
[14:31:02] <joal>	 elukey: I'm assuming you'll do all historiocal nodes first, then all overlord etc?
[14:32:31] <elukey>	 joal: I was thinking one host at the time (all daemons on it), but I can follow your procedure as well
[14:32:34] <elukey>	 no preference
[14:33:39] <joal>	 elukey: I have no clue if it changes anyhing 
[14:33:50] <joal>	 elukey: I'm assuming yours would be faster
[14:34:58] <elukey>	 joal: I am going to follow your way, I like it more
[14:35:04] <elukey>	 seems more consistent with the docs 
[14:35:08] <joal>	 ok :)
[14:35:18] <elukey>	 so druid1001's historical is alive 
[14:35:31] <elukey>	 going to check its logs, when it finishes loading I'll do druid1002
[14:35:39] <joal>	 I wouldn't ever have guessed that one day my views would be "more consistent with the docs" :D
[14:36:34] <elukey>	 it wasn't an offence but a compliment :P
[14:37:52] <elukey>	 proceeding with druid1002
[14:39:11] <elukey>	 Get:1 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main druid-common all 0.10.0-3~jessie1 [189 MB] :P
[14:45:11] <elukey>	 and upgrading 1003
[14:45:38] <elukey>	 so 1002 went fine, 1001 complained a bit about segments not owned and other things
[14:45:56] <joal>	 elukey: :(
[14:46:12] <joal>	 elukey: nothing major I assume, anymore details?
[14:46:51] <fdans>	 ottomata: hadn't tested this last iteration, sorry, pushing now a tested in stat1005 modification
[14:49:31] <elukey>	 joal: the majority of them are
[14:49:32] <elukey>	 Caused by: com.fasterxml.jackson.databind.JsonMappingException: Could not resolve type id 'hdfs' into a subtype of [simple type, class io.druid.segment.loading.LoadSpec] at [Source: N/A; line: -1, column: -1]
[14:50:05] <elukey>	 but now that I check, when I restarted the daemon it didn't log "loading segment bla bla" for all the things needed as druid100[23]
[14:50:39] <elukey>	 another interesting one is
[14:50:39] <elukey>	 2018-04-30T14:40:04,660 ERROR io.druid.server.coordination.ZkCoordinator: Failed to load segment for dataSource:
[14:50:42] <elukey>	 that ends with
[14:50:51] <elukey>	 dataSource='webrequest', binaryVersion='9'}}
[14:51:01] <joal>	 hm
[14:51:21] <joal>	 bizarre! I'm looking at coordinator, and it tells me everything is fine :(
[14:51:40] <elukey>	 joal: segments loaded in druid1001?
[14:52:11] <joal>	 I don't know about specific machines, but it tells me all segments for all datasources are loaded
[14:52:19] <joal>	 And that we have 3 nodes
[14:53:06] <joal>	 When looking at datasources, it shows segments from any workers (1,2,3)
[14:53:19] <joal>	 elukey: looks good to me is what that means :)
[14:54:01] <elukey>	 yeah now logs seems a bit better
[14:54:07] <elukey>	 going to wait a bit before proceeding
[14:56:19] <joal>	 elukey: still have historical for d1003, right?
[14:56:39] <elukey>	 already done
[14:56:47] <joal>	 Wow - Fastman !
[14:56:54] <elukey>	 historicals are on 0.10 now :)
[14:57:14] <joal>	 elukey: and I can haz dataz - Looks good
[14:57:57] <elukey>	 mmmm still seeing those errors
[14:58:02] <elukey>	 I am wondering if https://gerrit.wikimedia.org/r/#/c/355469/ was applied
[14:58:33] <nuria_>	 fdans: let's talk about El tests on standup, we should probably fix issues with jenkins
[14:59:31] <elukey>	 elukey@druid1001:~$ ls -l /usr/share/druid/extensions/druid-hdfs-storage-cdh/druid-hdfs-storage.jar
[14:59:35] <elukey>	 lrwxrwxrwx 1 root root 76 Apr 30 14:34 /usr/share/druid/extensions/druid-hdfs-storage-cdh/druid-hdfs-storage.jar -> /usr/share/druid/extensions/druid-hdfs-storage/druid-hdfs-storage-0.10.0.jar
[14:59:39] <elukey>	 seems fine
[15:00:08] <joal>	 elukey: on namenode side - There still is a bump in old-gen GC after nodes come back alive, but man, they now take ~200ms :)
[15:00:15] <elukey>	 \o/
[15:00:57] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4168150 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka2003.codfw.wmnet'] ```  and were **ALL** succ...
[15:01:00] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4168151 (10Ottomata)
[15:01:06] <nuria_>	 ping mforns elukey 
[15:01:09] <nuria_>	 standduppp
[15:02:48] <elukey>	 joal: https://etherpad.wikimedia.org/p/analytics-druid-upgrade
[15:04:12] <elukey>	 is there anything pushing popups events to druid1001?
[15:06:00] <joal>	 elukey: those segments are for 2017...
[15:06:02] <joal>	 hm
[15:08:21] <elukey>	 joal: and only for druid1001
[15:08:27] <joal>	 weird !
[15:08:30] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4168191 (10Ottomata) Done for main-codfw.  Will proceed with main-eqiad this afternoon.
[15:08:31] <joal>	 elukey: version difference?
[15:09:02] <wikibugs_>	 (03CR) 10Mforns: [C: 032] Modify output defaults for EventLoggingSanitization.scala [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429427 (https://phabricator.wikimedia.org/T190202) (owner: 10Mforns)
[15:14:49] <wikibugs_>	 (03Merged) 10jenkins-bot: Modify output defaults for EventLoggingSanitization.scala [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429427 (https://phabricator.wikimedia.org/T190202) (owner: 10Mforns)
[15:20:53] <elukey>	 joal: using a very scientific approach (a brutal restart of historical on druid1001) it loads segments now
[15:20:58] <wikibugs_>	 (03CR) 10Joal: "Bump :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/405053 (owner: 10Joal)
[15:32:15] <elukey>	 joal: I restarted the overlords (1001 is the new leader) and now I am doing middle managers
[15:32:23] <joal>	 ack !
[15:36:15] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats bug: deploy causes bad merges - https://phabricator.wikimedia.org/T192890#4168280 (10Milimetric) 05Open>03declined We're just going to wait until we deploy with SCAP
[15:37:33] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Hadoop HDFS Namenode shutdown on 26/04/2018 - https://phabricator.wikimedia.org/T193257#4168283 (10elukey) {F17441015}
[15:40:01] <elukey>	 joal: brokers done, last ones are coordinators
[15:40:08] <joal>	 k
[15:40:13] <joal>	 Those are the importan ones
[15:42:16] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4168294 (10Milimetric) I ran a quick check, and for 2018-03, there are 16% more distinct UA + IP combinations than distinct IPs, so that seems like a worthwhile change.  cc @Nuria
[15:42:30] <wikibugs_>	 (03CR) 10Nuria: "Let's talk with team how to best manage this datasource config (hiera? aqs alone?) before we merge this code." [analytics/aqs] - 10https://gerrit.wikimedia.org/r/429765 (https://phabricator.wikimedia.org/T193387) (owner: 10Joal)
[15:44:15] <elukey>	 joal: how do you access to the coordinator ui? 
[15:44:33] <joal>	 tunnel to master coord on port 8081
[15:44:36] <elukey>	 ssh -L 8081:localhost:8081 druid1002.eqiad.wmnet -N
[15:44:38] <elukey>	 perfect
[15:44:38] <joal>	 elukey: --^
[15:44:55] <joal>	 elukey: up for me, looks good
[15:44:56] <elukey>	 I had the wrong one
[15:45:03] <elukey>	 druid1001 was not responding :P
[15:46:28] <nuria_>	 ping milimetric 
[15:46:37] <nuria_>	 ping joal 
[15:47:42] <elukey>	 joal: all upgraded \o/
[15:47:51] <joal>	 \o/ elukey !
[15:49:27] <joal>	 elukey: resuming hourly jobs?
[15:50:24] <elukey>	 joal: yep!
[15:53:31] <joal>	 !log Resume webrequest-druid-hourly-coord and pageview-druid-hourly-coord
[15:53:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:01:14] <joal>	 Success elukey :)
[16:01:27] <joal>	 elukey: Do you mind restarting realtime before moving to ops meeting?
[16:08:05] <elukey>	 joal: yep doing it now!
[16:08:13] <joal>	 thanks elukey :)
[16:10:29] <elukey>	 re-enabled the cron, it should create the job soon!
[16:11:59] <joal>	 elukey: Checked the realtime job - it's failing - I'm gonna launch a manual version of it
[16:13:21] <elukey>	 :(
[16:16:14] <joal>	 elukey: I can see 2 cron about BannerImpression on an1003!!
[16:16:41] <elukey>	 oooofff
[16:18:15] <elukey>	 joal: pretty sure it is a pebkac.. the process spawns every 5 mins or runs continously?
[16:18:38] <elukey>	 it theory every 5 min..
[16:18:38] <elukey>	 mmm
[16:19:34] <elukey>	 joal: wait a min, I can see two things related to BannerImpression but one is the child of the other (bash and java)
[16:19:37] <elukey>	 no?
[16:20:27] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 4 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168440 (10Tgr)
[16:21:26] <elukey>	 and from the overlord console I can't see failures..
[16:21:53] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 4 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168443 (10Pchelolo) > The problem started within hours of Kafka being enabled on mediawikiwiki, and it affects the wiki that's after mediawikiwiki a...
[16:23:20] <joal>	 elukey: I think the first one failed because of the 2 crons
[16:23:34] <joal>	 elukey: new job seems to work (I continue to monitor)
[16:24:22] <elukey>	 no idea why there were two
[16:25:26] <joal>	 elukey: supposedly temporary patch :(
[16:27:52] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168446 (10mobrovac) >>! In T193254#4168443, @Pchelolo wrote: > I believe this is the case. When switching the job to Kafka it was done only for test...
[16:27:58] <wikibugs_>	 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 2 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4168449 (10mobrovac)
[16:28:04] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168448 (10mobrovac)
[16:30:47] <wikibugs_>	 10Analytics, 10EventBus, 10Wikimedia-Logstash, 10Services (watching): EventBus HTTP Proxy service does not report errors to logstash - https://phabricator.wikimedia.org/T193230#4168453 (10fdans) p:05High>03Triage
[16:32:58] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 6 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168457 (10fdans) p:05High>03Triage
[16:33:43] <mobrovac>	 hey hey
[16:33:49] <mobrovac>	 can i get the context for these ^ ?
[16:33:50] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Unify, if possible, AQS and Restbase's cassandra dashboards - https://phabricator.wikimedia.org/T193017#4157303 (10fdans) p:05Normal>03Low
[16:34:06] <mobrovac>	 why are you de-triaging tickets?
[16:34:08] <mobrovac>	 what is going on?
[16:34:10] <mobrovac>	 fdans: ^ ?
[16:34:10] <joal>	 Hi mobrovac - We're in grooming, and were saying that we actually don't know
[16:34:17] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4168469 (10mforns) p:05Triage>03Normal
[16:34:21] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 6 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168466 (10Milimetric) p:05Triage>03High sorry - reverting accidental change of priority
[16:34:51] <mobrovac>	 joal: im sorry but you cannot do that for tickets like https://phabricator.wikimedia.org/T193254
[16:35:16] <joal>	 mobrovac: We didn't meant to deprioritize nor change, we have a bug in our process
[16:35:32] <nuria_>	 mobrovac: issue is that eventbus tickets always tag analytics
[16:35:47] <joal>	 mobrovac: it'll be corrected, we changed the priority of he ticket back
[16:35:57] <mobrovac>	 k thnx
[16:36:13] <joal>	 np mobrovac - Sprry for the noise
[16:36:23] <mobrovac>	 while i'm here, why is https://phabricator.wikimedia.org/T193230 being put on "radar"
[16:36:33] <mobrovac>	 i was hoping you guys would look into that?
[16:36:50] <ottomata>	 mobrovac:  i think that was because i was not present in that tasking meeting so couldn't give context
[16:36:54] <ottomata>	 i'll try to find some time to do that
[16:37:04] <ottomata>	 joal:  we should put that in ops excellence column
[16:37:06] <mobrovac>	 hehehehehe
[16:37:08] <joal>	 Thanks for showing up ottomata  :)
[16:37:09] <mobrovac>	 kk thnx ottomata!
[16:41:19] <joal>	 elukey: could we clean that crontab?
[16:43:59] <elukey>	 joal: there was indeed a duplicated entry, maybe a copy/pasta error from me, not really sure (I checked carefully and ran puppet when I re-enabled)
[16:44:02] <elukey>	 sorry :(
[16:44:16] <joal>	 elukey: I think it was Andrew, a long ago - nevermind :)
[16:45:01] <elukey>	 ran puppet, cron looks clean
[16:45:05] <joal>	 The first one was actaully incorrect :)
[16:45:09] <joal>	 Awesome - Tahnks
[16:46:32] <elukey>	 ahh nice I can see real time tasks!
[16:46:43] * elukey dances
[16:47:49] <joal>	 ;)
[16:47:51] <wikibugs_>	 10Analytics: [reportupdater] Add a configurable hive client - https://phabricator.wikimedia.org/T193169#4168542 (10mforns) p:05Triage>03Normal
[16:48:11] <joal>	 elukey: I had to restart a manual job, for the same error we were having before we used the overlord-drain process
[16:48:17] <wikibugs_>	 10Analytics, 10Analytics-Wikistats, 10ORES, 10Scoring-platform-team: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479#4168545 (10Milimetric) ping @awight still want to chat about this?
[16:48:20] <joal>	 elukey: everything back to normal :)
[16:48:28] <wikibugs_>	 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4168547 (10mforns) p:05Triage>03Normal
[16:49:40] <wikibugs_>	 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4168554 (10mforns) p:05Triage>03Low
[16:49:49] <wikibugs_>	 10Analytics, 10Commons, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Make gwtoolsetUploadMediafileJob JSON-serializable - https://phabricator.wikimedia.org/T192946#4168557 (10Milimetric) p:05Triage>03Normal sorry - reverting accidental change of priority
[16:50:23] <wikibugs_>	 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4168562 (10mforns) p:05Triage>03Normal
[16:50:26] <wikibugs_>	 10Analytics, 10Collaboration-Team-Triage, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Make EchoNotification job JSON-serialiable - https://phabricator.wikimedia.org/T192945#4168560 (10Milimetric) p:05Triage>03Normal sorry - reverting accidental change of priority
[16:52:22] <wikibugs_>	 10Analytics, 10EventBus, 10Wikimedia-Logstash, 10Services (watching): EventBus HTTP Proxy service does not report errors to logstash - https://phabricator.wikimedia.org/T193230#4168578 (10mforns) p:05Triage>03Normal
[16:52:55] <wikibugs_>	 10Analytics, 10EventBus, 10Wikimedia-Logstash, 10Services (watching): EventBus HTTP Proxy service does not report errors to logstash - https://phabricator.wikimedia.org/T193230#4168580 (10mforns) p:05Normal>03High
[16:55:33] <elukey>	 joal: tomorrow you are off right?
[16:55:47] <joal>	 elukey: normally yes, I'll be on/off, but not working a lot
[16:55:55] <joal>	 Maybe a bit more in the evening
[16:56:06] <joal>	 elukey: I think we can call the druid upgrade a success :)
[16:56:07] <elukey>	 ah no I meant if it is public holiday in france
[16:56:17] <joal>	 elukey: yes, public holiday inded
[16:56:43] <elukey>	 nice! so if you are ok I'd propose to upgrade druid100[456] on Wed 
[16:56:53] <joal>	 works for me elukey :)
[16:56:57] <joal>	 thanks !
[16:56:58] <elukey>	 super :)
[16:57:11] <elukey>	 after that I'll start working on 0.11
[16:57:20] <joal>	 elukey: <3
[16:57:23] <wikibugs_>	 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 2 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4168618 (10mobrovac)
[16:57:38] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4168614 (10mobrovac) 05Open>03Resolved a:03mobrovac We have switched the LocalRenameUserJob for all wikis to EventBus, so we don't anticipate a...
[16:57:42] <joal>	 Thanks again also elukey for the super productive change on NN today - This is great :)
[16:58:01] <elukey>	 joal: well it was a 50/50 effort, thank you too :)
[17:02:07] <wikibugs_>	 10Analytics: Update anonymous grouping to use User Agent - https://phabricator.wikimedia.org/T193415#4168647 (10Milimetric)
[17:02:11] <wikibugs_>	 10Analytics: Update anonymous grouping to use User Agent - https://phabricator.wikimedia.org/T193415#4168658 (10Milimetric) p:05Triage>03Normal
[17:02:19] <wikibugs_>	 10Analytics: Update anonymous grouping to use User Agent - https://phabricator.wikimedia.org/T193415#4168647 (10Milimetric) p:05Normal>03High
[17:21:39] <wikibugs_>	 10Analytics, 10Analytics-Wikistats, 10ORES, 10Scoring-platform-team: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479#4168727 (10awight) @Milimetric Sorry to miss the earlier ping.  Some top-level metrics we can offer, (@Halfak please chime in here): * Total number of pred...
[17:24:05] <elukey>	 going afk!
[17:24:07] * elukey off!
[17:50:40] <chelsyx>	 Hi! Quick question: how should I change the names of fields of existing schema? Or maybe I shouldn't do that? https://meta.wikimedia.org/w/index.php?title=Schema:MobileWikiAppDailyStats&diff=17983055&oldid=17836915
[17:53:46] <ottomata>	 chelsyx:  unfortunetly you can't quite do it like that
[17:53:52] <ottomata>	 because some of those fields are currently 'required'
[17:53:54] <ottomata>	 so
[17:54:02] <ottomata>	 we only support adding new fields
[17:54:05] <ottomata>	 new optional fields
[17:54:14] <ottomata>	 you can remove old optional fields, and nothing will break
[17:54:33] <ottomata>	 but since those two fields are marked as 'required'
[17:54:50] <ottomata>	 you can only add the new fields .e.g 'app_install_id', but you can't remove 'appInstallId'
[17:56:32] <chelsyx>	 Got you. Thank you! I revert that edit
[18:19:59] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4168953 (10Ottomata)
[18:21:43] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4168964 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['ka...
[18:23:39] <wikibugs_>	 (03PS1) 10Joal: Update changelog.md for v0.0.63 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429854
[18:24:15] <ottomata>	 fdans:  while my kafka reimages are happening
[18:24:29] <ottomata>	 shall we merge the geoip patch?
[18:24:39] <wikibugs_>	 (03CR) 10Joal: [C: 032] Make stats gathering optional in mediawiki-history [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/427067 (owner: 10Joal)
[18:32:42] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169027 (10alanajjar) @mobrovac I think it still the same!  see [[https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Menageross03|here]], t...
[18:33:07] <wikibugs_>	 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 2 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4169039 (10alanajjar)
[18:33:11] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169038 (10alanajjar) 05Resolved>03Open
[18:35:01] <wikibugs_>	 (03Merged) 10jenkins-bot: Make stats gathering optional in mediawiki-history [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/427067 (owner: 10Joal)
[18:37:08] <wikibugs_>	 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 2 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4169045 (10alanajjar)
[18:37:16] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169044 (10alanajjar) 05Open>03Resolved
[18:37:54] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4164240 (10alanajjar) Thanks a lot all
[18:38:51] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169059 (10Pchelolo) > As you know, we can't say it resolved until we being sure, because there's many pending requests, so if we said to all global...
[18:39:19] <wikibugs_>	 (03PS2) 10Joal: Update changelog.md for v0.0.63 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429854
[18:39:47] <joal>	 ottomata: if you have a minue --^
[18:39:55] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169060 (10alanajjar) Yes @Pchelolo I noticed that now, Thanks again
[18:40:10] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] Update changelog.md for v0.0.63 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429854 (owner: 10Joal)
[18:40:14] <joal>	 hanks ottomata 
[18:40:19] <ottomata>	 :)
[18:40:34] <wikibugs_>	 (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429854 (owner: 10Joal)
[18:43:37] <joal>	 !log Releasing refinery-source
[18:43:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:49:09] <joal>	 fdans: have you tested the dry-run feature in prod?
[18:58:05] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4169111 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka1001.eqiad.wmnet'] ```  and were **ALL** succ...
[18:59:03] <joal>	 ottomata: Heya - I have an issue with archiva - Jenkins tells me bad gateway when trying to upload
[18:59:13] <joal>	 ottomata: Have you seen that before?
[18:59:46] <ottomata>	 hm no
[18:59:52] <ottomata>	 bad gateway?
[18:59:54] <ottomata>	 lemme tail logs
[18:59:58] <joal>	 https://integration.wikimedia.org/ci/job/analytics-refinery-release/103/org.wikimedia.analytics.refinery$refinery/console
[19:00:01] <joal>	 ottomata: --^
[19:02:22] <ottomata>	 weird joal it did get uploaded though
[19:02:29] <joal>	 hm
[19:02:31] <ottomata>	 PUT /repository/releases/org/wikimedia/analytics/refinery/core/refinery-core/0.0.63/refinery-core-0.0.63.jar HTTP/1.0" 201
[19:02:37] <joal>	 ottomata: Shall I restart the job?
[19:02:38] <ottomata>	 and i cna dl the file
[19:02:39] <ottomata>	 i guess?
[19:02:48] <joal>	 ok, trying again
[19:10:54] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4169126 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['ka...
[19:24:06] <joal>	 ottomata: I'm now facing a git issue, tag having been created :(
[19:24:19] <joal>	 ottomata: Should I bump the tag, or is there a way to delete it?
[19:25:18] <ottomata>	 i think we can delete
[19:25:33] <ottomata>	 you might be able to joal
[19:25:33] <ottomata>	 try
[19:25:41] <ottomata>	 git push gerrit :v0.0.63
[19:25:43] <joal>	 will try ottomata 
[19:25:44] <ottomata>	 or whatever the tag is
[19:27:14] <wikibugs_>	 10Analytics: Rename new geowiki  to geoeditors - https://phabricator.wikimedia.org/T193429#4169205 (10Nuria)
[19:27:28] <wikibugs_>	 10Analytics: Rename new geowiki  to geoeditors - https://phabricator.wikimedia.org/T193429#4169215 (10Nuria)
[19:27:32] <wikibugs_>	 10Analytics-Kanban: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996#4169214 (10Nuria)
[19:28:38] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4169222 (10Nuria) OK ,agreed to change datasource name to new and shinny geoeditors. Please @Ijon let us know if you have something against it. Changed superset and wikis thus far
[19:32:54] <wikibugs_>	 (03PS1) 10Joal: Reverting maven commits after failed deploy [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429861
[19:32:58] <joal>	 ottomata: it worked :)
[19:33:00] <joal>	 Thanks !
[19:33:22] <wikibugs_>	 (03CR) 10Joal: [V: 032 C: 032] "Merging before deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429861 (owner: 10Joal)
[19:34:40] <joal>	 !log Retry releasing refinery-source to archiva
[19:34:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:52:39] <joal>	 ottomata: failed for a different reason ... I'm gonna clean the things, and then try again on wednesday
[19:53:30] <ottomata>	 ok
[19:53:32] <ottomata>	 sorry joal :/
[19:53:40] <joal>	 np ottomata - Thanks for the help
[19:54:32] <wikibugs_>	 (03PS1) 10Joal: Reverting again maven commits after failed deploy [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429867
[19:54:53] <wikibugs_>	 (03CR) 10Joal: [V: 032 C: 032] "Merging for clean state" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429867 (owner: 10Joal)
[19:55:42] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4169394 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka1002.eqiad.wmnet'] ```  and were **ALL** succ...
[19:59:17] <joal>	 Gone for tonight after 2 failed deploy :(
[20:02:49] <wikibugs_>	 10Analytics: Weird performance of sqoop job on Edit Reconstruction - https://phabricator.wikimedia.org/T172579#4169423 (10Milimetric) 05Resolved>03Open no, this issue is unrelated to the changes we've made, and likely to still be a problem.
[20:05:10] <nuria_>	 joal: can i help you with deployment?
[20:43:22] <wikibugs_>	 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform (AKA Event(Logging) of the Future (EoF)) - https://phabricator.wikimedia.org/T185233#4169592 (10Pchelolo)
[21:21:38] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169694 (10MarcoAurelio) Do we need to migrate `CentralAuthRename` too? If so, can it be done? Thanks.
[21:24:07] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169698 (10Pchelolo) > Do we need to migrate CentralAuthRename too? If so, can it be done? Thanks.  Eventually everything will be migrated. Are you s...
[21:27:06] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169701 (10Tgr) That's a log channel, not a job queue. Other potentially affected jobs are LocalUserMergeJob (not sure if Wikimedia wikis still allow...
[21:28:13] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169704 (10Tgr) >>! In T193254#4169701, @Tgr wrote: > LocalPageMoveJob (I think that's triggered differently, not quite sure though).  Yes it is. So...
[21:31:58] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169715 (10MarcoAurelio) We are not performing any user account merges nor globally nor locally. Regards.
[21:36:15] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169720 (10Tgr) Other instances of cross-wiki job scheduling that are yielded by a quick `ack 'JobQueueGroup::singleton\( '`: Cognate/LocalJobSubmitJ...
[21:46:02] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169741 (10Pchelolo) > Other instances of cross-wiki job scheduling that are yielded by a quick ack 'JobQueueGroup::singleton\( ': Cognate/LocalJobSu...
[22:06:29] <wikibugs_>	 10Analytics, 10EventBus, 10GlobalRename, 10MediaWiki-JobQueue, and 5 others: Global renames get stuck at metawiki - https://phabricator.wikimedia.org/T193254#4169799 (10MarcoAurelio) Is this related to T192604 anyhow? Regards.
[22:29:51] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4169832 (10Ottomata) only kafka1003 remains...will do tomorrow.
[22:36:47] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4169845 (10Tbayer) >>! In T191343#4168012, @Milimetric wrote: > I'm all for renaming to Geoeditors, but at this point it does involve a bunch of busy work, so I want to make sure everyone agrees, including Asaf. >  >> Talking a...
[22:38:36] <milimetric>	 Ah, good, should do that for all Archive pages, note to self
[22:45:54] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad average message consume rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[22:52:08] <nuria_>	 milimetric: i just remember that i did that on our pages on medoiawiki.org after Nemo_bis told me, had totally forgot
[22:52:13] <nuria_>	 *forgotten
[22:52:55] <milimetric>	 oh yeah, I vaguely remember that, so I forgot even more :)
[22:55:07] <ottomata>	 !log bouncing kafka main-eqiad -> eqiad (analytics) mirror maker
[22:55:08] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[22:56:11] <ebernhardson>	 ottomata: doesn't seem to be an alert, but around the same time it looks like https://stream.wikimedia.org/v2/stream/recentchange started 502'ing 
[22:56:30] <ebernhardson>	 seems plausibly related, although i'm unsure the architecture exactly
[22:57:41] <ottomata>	 502ing?!
[22:57:45] <ottomata>	 that is likely related
[22:57:50] <ottomata>	 it lives on this mirror maker instance
[22:57:55] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad average message consume rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 7892 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[22:57:55] <ottomata>	 i might take job queue stuff out of this one too
[22:58:36] <ebernhardson>	 looks to have come back now, lines up well :)
[22:59:07] <ottomata>	 yeah, but it isn't going to stay
[22:59:15] <ottomata>	 ebernhardson:  how did you notice the eventstream 502?
[22:59:34] <ebernhardson>	 ottomata: i have a silly thing in labs that watches the stream and generates image signatures for a a sampling of uploads
[22:59:49] <ebernhardson>	 it got the 502's
[22:59:57] <ebernhardson>	 s/labs/cloud/
[23:00:12] <ottomata>	 ah hm ok 
[23:01:27] <wikibugs_>	 (03PS2) 10Nuria: [WIP] UA parser specification changes for OS version [analytics/ua-parser/uap-java] (wmf) - 10https://gerrit.wikimedia.org/r/429527 (https://phabricator.wikimedia.org/T189230)
[23:03:30] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 6.603e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[23:04:34] <ottomata>	 !log blacklisting change-prop and job queue topics from main-eqiad -> analytics (eqiad)
[23:04:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[23:05:02] <ebernhardson>	 although, i don't seem to be receiving any changes still. it only stopped 502
[23:05:11] <ottomata>	 yeah
[23:05:16] <ottomata>	 i dunno why that would 502 on this problem
[23:05:22] <ottomata>	 but messages are stuck  ^^^ should help
[23:05:31] <ottomata>	 give it a few
[23:16:53] <ottomata>	 ebernhardson:  should be ok now
[23:17:31] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad
[23:17:48] <ottomata>	 that is ok!
[23:17:49] <ottomata>	 i will ack that
[23:17:58] <ottomata>	 it is going to report that for a week, until the now blacklisted topics expire
[23:18:01] <ottomata>	 OH
[23:18:02] <ottomata>	 recovery!
[23:18:51] <ottomata>	 oh, the change-prop & job topcis are alreayd blacklisted from alerting
[23:18:52] <ottomata>	 ok cool
[23:19:01] <ottomata>	 so yeah, lag on e.g. recentchange stream fixed