[00:20:32] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3818603 (10Pchelolo) For the reference, next time we migrate recursive jobs we need to switch off Redis queue production before switching on Kafka cons...
[02:30:45] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3818753 (10Pchelolo) The backlog was cleared now, all seems in good shape.
[03:17:44] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Handle error due to lack of data - https://phabricator.wikimedia.org/T182224#3817033 (10Milimetric)
[03:29:43] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274#3818793 (10Pchelolo) p:05Triage>03Low
[03:30:04] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274#3818807 (10Pchelolo)
[05:10:14] <wikibugs>	 10Analytics, 10Analytics-Cluster: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276#3818910 (10EBernhardson)
[06:50:26] <elukey>	 ebernhardson: really interested in that, please keep me in the loop! :)
[08:31:49] <elukey>	 !log stop camus on an1003 as prep step for reboot
[08:31:50] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:45:49] <joal>	 Hi elukey 
[08:45:59] <joal>	 here to help if needed
[08:47:57] <elukey>	 :)
[08:48:12] <elukey>	 just released the druid_exporter 0.5 with realtime metrics
[08:48:37] <joal>	 Man, this is just awesome :)
[08:48:39] <elukey>	 and deployed the puppet change to enable middle managers to push metrics (well peons to be precise :)
[08:48:57] <elukey>	 now I am checking up some metrics about segments since I think there might be a bug
[08:49:15] <joal>	 elukey: I've killed the realtime job - I need to deep dive into it more cause it has failed like 5 times yesterday
[08:50:28] <elukey>	 joal: really interested to figure out what's wrong, lemme know if I can help
[08:50:39] <elukey>	 ahhh!
[08:50:39] <elukey>	 druid_historical_segment_used_bytes{datasource="_default_tier"} 31841088057.0
[08:50:43] <elukey>	 [..]
[08:50:45] <elukey>	 druid_historical_segment_used_bytes{datasource="_default_tier"} 96799895856.0
[08:50:48] <elukey>	 druid_historical_segment_used_bytes{datasource="_default_tier"} 22594884516.0
[08:50:51] <elukey>	 this is definitely not right
[08:50:53] <elukey>	 uffffff
[08:51:25] <elukey>	 after the restart of the exporter historical bytes used jumped to 29G
[08:51:36] <elukey>	 and i was like "what?"
[08:54:03] <elukey>	 they don't have the datasource
[08:54:18] <elukey>	 or better, they have it wrong
[08:54:48] <elukey>	 ah yes because I am stupid
[09:10:56] <elukey>	 so this time I need to find a proper solution for something that I tried to postpone 
[09:11:12] <elukey>	 namely the fact that metric names are not unique (like segment/used)
[09:11:18] <joal>	 hm
[09:11:48] <elukey>	 sorry joal I was dumping my thoughts without a lot of context
[09:11:59] <joal>	 no worries elukey 
[09:12:04] <elukey>	 so segment/used can be emitted by historical and coordinator
[09:12:21] <elukey>	 but with different dimensions (tier + datasource  vs datasource only)
[09:12:41] <elukey>	 I thought I found a solution to use one data structure for all the metrics
[09:12:56] <elukey>	 but then I had to do little hacks to preserve this 
[09:13:03] <elukey>	 now I am going to do things properly :)
[09:14:10] <mforns>	 hi aaaallll!
[09:14:31] <joal>	 Hi mforns 
[09:16:27] <elukey>	 o/
[09:30:40] <elukey>	 joal: from yarn's UI it seems that we only have 3 hive queries left from various users
[09:30:51] <elukey>	 I'd wait a bit more and then proceed with the reboot
[09:30:52] <joal>	 elukey: watching
[09:31:15] <joal>	 elukey: given you sent an email yesterday, I say go for it
[09:31:28] <joal>	 I know bearloga's one will rerun automatically - for thoers I don't know
[09:32:13] <elukey>	 maybe I can wait amire80's one to complete (shouldn't take long)
[09:32:22] <joal>	 correct
[09:32:46] <joal>	 and actually elukey, tilman's one is also well advanced
[09:33:21] <elukey>	 ok let's wait another say hour
[09:33:24] <elukey>	 and then decide
[09:33:42] <joal>	 elukey: hopefull it'll be faster than thtat
[09:34:41] * elukey is always the pessimist
[09:34:59] <elukey>	 joal: I just want to highlight the fact that this time I've only stopped camus :P
[09:35:03] * elukey hides 
[09:35:55] <joal>	 huhuhu elukey :)
[09:38:39] <elukey>	 joal: <#
[09:38:40] <elukey>	 <3
[09:43:02] <addshore>	 he he elukey!
[09:43:28] <addshore>	 just wondering, what time did you steal notebook1002 yesterday?
[09:44:17] <elukey>	 addshore: still haven't done anything, just copied homedirs over to notebook1001
[09:44:23] <elukey>	 addshore: do you use it?
[09:44:38] <addshore>	 nope
[09:45:04] <addshore>	 but i noticed a bunch of PAWS edits stop on wikidata @ 23:00 yesterday so just wondered if it had antyhing to do with it :)
[09:45:15] <addshore>	 maybe a user got missed, but as you havn't touched anything yet i guess now!
[09:46:12] <elukey>	 nono nothing has been done so far afaik
[09:46:44] <elukey>	 https://tools.wmflabs.org/sal/production?p=0&q=notebook&d= seems to confirm that
[10:11:02] <elukey>	 joal: amire80 seems to have fired another query, maybe it is an automatic job?
[10:11:29] <joal>	 elukey: I think it is
[10:11:37] <joal>	 elukey: tilman's one is finished - Let's go
[10:11:40] <elukey>	 super
[10:11:43] <joal>	 this was the huge one
[10:13:09] <elukey>	 ah snap one thing might be problematic
[10:13:25] <elukey>	 The druid clusters are using mysql on analytics1003
[10:13:37] <joal>	 mwarf
[10:13:47] <joal>	 the prod one as well I guess
[10:13:52] <elukey>	 both yes
[10:13:55] <joal>	 Marf bis
[10:14:04] <joal>	 I think we need to stop them
[10:14:49] <elukey>	 I am sure that the overlord uses the db
[10:15:04] <elukey>	 realtime nodes too (but we are good)
[10:15:21] <elukey>	 so in theory, a brief downtime should not be cause a major issue to Druid
[10:15:27] <joal>	 yes - realtime is down for now
[10:15:29] <elukey>	 the overlord should complain a bit in the logs
[10:16:06] <elukey>	 but then it should recover once the db is up
[10:16:22] <elukey>	 we are not indexing anything right now
[10:16:52] <joal>	 ok - let's move then
[10:18:31] <elukey>	 rebooting in a min
[10:23:20] <elukey>	 so it seems stuck in something like https://www.reddit.com/r/debian/comments/2jyquk/systemd_issue_at_boot_a_start_job_is_running_for/
[10:25:10] <elukey>	 moritzm: (if you have time)
[10:25:51] <elukey>	 after rebooting analytics1003 (a bit important for us) the host is stuck in "Create volatile files and .."
[10:25:59] <elukey>	 well systemd is stuck
[10:26:27] <moritzm>	 let me have a look
[10:27:48] <moritzm>	 elukey: it's back up
[10:28:21] <elukey>	 moritzm: so only mentioning your name scared it away? :D
[10:28:31] <moritzm>	 I didn't do anything (when I had logged in via mgmt, I saw the last startup line for oozie and then the console prompt came up
[10:28:45] <moritzm>	 probably yes, it's like saying Candyman a few times in front of a mirror
[10:28:53] <elukey>	 ahhahha
[10:28:55] <elukey>	 I
[10:29:12] <elukey>	 I'll try to see if I can follow up on this issue, I don't like it a lot :)
[10:29:13] <moritzm>	 but I'll have a look whether I can spot something, you can analyse startup times post-boot with systend
[10:29:39] <elukey>	 thanks a lot!
[10:31:21] <moritzm>	 wow, 13 seconds kernel startup and 6:22 mins (!) userspace
[10:31:49] <elukey>	 yeah :(
[10:31:58] <moritzm>	 6:10 of that spent in systemd-tmpfiles-setup.service
[10:32:53] <elukey>	 maybe a bloated /tmp?
[10:33:17] <elukey>	 mmm doesn't seem so
[10:33:59] <elukey>	 ah "systemd-tmpfiles creates, deletes, and cleans up volatile and temporary files and directories, based on the configuration file format and location specified in tmpfiles.d(5)."
[10:34:34] <moritzm>	 yeah, but there are none :-)
[10:34:43] <moritzm>	 I'll try running the command manually
[10:35:16] <moritzm>	 ends in a few ms
[10:36:24] <elukey>	 maybe /tmp was huge due to hadoop garbage and systemd-tmpfiles spent a huge time cleaning?
[10:36:47] <elukey>	 joal: everything seems fine to me, shall I re-enable camus?
[10:39:26] <moritzm>	 those cleanups should have happened during the system shutdown as part of the reboot alredy
[10:39:54] <moritzm>	 I have no idea, all the tmpfiles configs seem common to all other servers
[10:40:19] <moritzm>	 it's /etc/tmpfiles.d, /run/tmpfiles.d and /usr/lib/tmpfiles.d
[10:40:34] <moritzm>	 maybe one of those was blocked on an hfds mount or so?
[10:40:51] <moritzm>	 no idea
[10:41:56] <elukey>	 I'll investigate!
[10:42:24] <icinga-wm>	 PROBLEM - Hive Metastore on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore
[10:42:44] <icinga-wm>	 PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2
[10:42:56] <elukey>	 ssh analytics1003.eqiad.wmnet
[10:42:56] <elukey>	 System is booting up. See pam_nologin(8)
[10:42:57] <elukey>	 Authentication failed.
[10:43:00] <elukey>	 whatttt
[10:46:10] <elukey>	 joal: this is really bad
[10:46:32] <mforns>	 O.o
[10:47:15] <joal>	 aouch elukey 
[10:47:23] <elukey>	 I can see System is booting up. See pam_nologin(8)
[10:51:08] <elukey>	 host is booting, I had to powercycle it
[10:53:12] <joal>	 elukey: hm - I'll need more explanations on power hardware
[10:54:19] <elukey>	 so an1003 is back up
[10:54:56] <joal>	 elukey: succesfully back up, or in error-mode back up?
[10:55:24] <icinga-wm>	 PROBLEM - Hive Metastore on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore
[10:55:44] <icinga-wm>	 PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2
[10:55:56] <joal>	 hm - doesn't smeel good :(
[10:56:48] <elukey>	 mysql is up, hive doesn't like a lot what happened
[10:57:23] <moritzm>	 I have no idea why that host rebooted
[10:57:38] <joal>	 elukey: let me know how you wish to help
[10:57:41] <wikibugs>	 10Analytics-Kanban, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3819562 (10Marostegui)
[10:57:43] <joal>	 log investigation?
[10:58:01] <elukey>	 joal: now hive seems up
[10:58:48] <joal>	 elukey: connection from stat1004 doesn't work (hive)
[10:59:09] <joal>	 :Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused)
[11:00:05] <elukey>	 in theory I see tcp        0      0 0.0.0.0:9978            0.0.0.0:*               LISTEN      1412/java
[11:00:14] <elukey>	 and 1412 is hive-server
[11:00:16] <moritzm>	 several of the system services stopped/went down at 10:50:07
[11:00:44] <moritzm>	 similar to what would have happened during a controlled reboot (but there was none AFAICT)
[11:00:46] <elukey>	 moritzm: that one was me powercycling
[11:01:11] <elukey>	 <elukey> !log powercycle analytics1003 - no serial console, ssh stuck in  System is booting up. See pam_nologin(8)
[11:01:26] <moritzm>	 ah, indeed, missed SAL
[11:01:34] <joal>	 elukey: it seems stat1004 still can't connect
[11:01:36] <moritzm>	 and before that the host was inaccessible via standard SSH?
[11:01:37] <elukey>	 joal: ah Connecting to jdbc:hive2://analytics1003.eqiad.wmnet:10000
[11:01:53] <elukey>	 moritzm: yeah and Hive was down (server + metastore)
[11:03:40] <moritzm>	 strange
[11:03:57] <elukey>	 so port 10000 is not up, trying to figure out what daemon is responsible for it
[11:04:00] <elukey>	 I thought hive-server
[11:04:15] <moritzm>	 in kern.log.1 there's plenty of log messages where CPUs maxed out the temperature threshold, throttling the CPU
[11:04:22] <moritzm>	 maybe some hw/thermal problem?
[11:06:33] <moritzm>	 according to our ferm config 10000 is hive server, yes
[11:09:34] <elukey>	 Dec 07 11:08:46 analytics1003 hive-server2[7976]: Failed to start Hive Server2. Return value: 1 ... failed
[11:09:44] <elukey>	 tried to restart it and I got this, nothing relevant on the logs
[11:09:53] <elukey>	 might be the new prometheus settings
[11:10:01] <elukey>	 but everything was working fine befor
[11:11:40] <moritzm>	 oh, hive-server really needs a native systemd unit...
[11:12:46] <elukey>	 FATAL ERROR in native method: processing of -javaagent failed
[11:13:40] <moritzm>	 fails to bind, Address already in use
[11:13:57] <moritzm>	 maybe the prometheus exporter blocks some port which is expected by hive?
[11:14:21] <elukey>	 in theory no
[11:15:14] <moritzm>	 or the jmx stuff? it's at least part of the traceback (io.prometheus.jmx* below)
[11:15:19] <elukey>	 ah yes in netstat I can see the ports already used by old processes
[11:15:21] <elukey>	 wtf
[11:16:32] <elukey>	 so systemctl restart might not be working correctly
[11:16:45] <elukey>	 GOTO: "oh, hive-server really needs a native systemd unit..."
[11:16:52] <moritzm>	 that's why I mentioned a proper systemd service unit :-)
[11:16:54] <moritzm>	 exactly :-)
[11:17:10] <moritzm>	 it's reall hard to properly track dependencies with the service units that get derived from an init script
[11:17:14] <elukey>	 so now hive-server is up
[11:17:19] <elukey>	 but port 10000 is not
[11:18:22] <moritzm>	 it seems to be using 9978, 46077, 39147
[11:18:39] <moritzm>	 46077 and 39147 are probably ephemeral, but is 9978 expected?
[11:19:17] <moritzm>	 can't find it in our puppet/ferm rules at least
[11:19:50] <elukey>	 so 51010 is the exporter
[11:20:13] <elukey>	 9978 is jmx
[11:23:11] <elukey>	 so at this point I think that hive is not configured anymore for port 10000
[11:23:22] <elukey>	 that might be due to the last refactoring, something that I've missed
[11:23:55] <moritzm>	 I think I have a hunch
[11:24:12] <moritzm>	 I ran 'dpkg -L hive-server2'
[11:24:29] <moritzm>	 and it lists a /etc/default/hive-server2 which should be present per package status db
[11:24:32] <moritzm>	 but it's not around
[11:24:42] <moritzm>	 and it's the file which is sourced from the init script
[11:24:46] <moritzm>	 which refers to $PORT
[11:24:53] <moritzm>	 but $PORT isn
[11:25:33] <moritzm>	 and $PORT should probably have been read from /etc/default/hive-server2
[11:25:52] <moritzm>	 or am I missing something and we intentionally kill this file in puppet?
[11:26:20] <moritzm>	 I don't have all those analytics git submodules checked out I think
[11:27:14] <elukey>	 mmmm not that I know
[11:27:32] <elukey>	 so we set profile::hive::client::server_port: 10000
[11:27:40] <elukey>	 that is in turn only used in the beeline script
[11:28:13] <elukey>	 and it was not part of my last refactoring
[11:28:20] <elukey>	 so this might be a bomb that was waiting me to restart
[11:28:27] <moritzm>	 but that's for hive_client, isn't it?
[11:29:32] <elukey>	 yeah, I am checking also class { '::cdh::hive::server': }
[11:29:47] <elukey>	 we rely on having the same config everywhere, that is brought by the client class
[11:29:53] <elukey>	 (I know naming is a bit weird)
[11:31:06] <moritzm>	 mhh, /etc/default/hive-server2 seems to be a red herring
[11:31:24] <moritzm>	 I downloaded the pristine deb from our mirrors and it only contains
[11:31:27] <moritzm>	 #PORT=
[11:31:35] <moritzm>	 (and some copyright blurb)
[11:32:14] <moritzm>	 we could ofc try to simply add /etc/default/hive-server2 with PORT=10000 and restart hive-server
[11:32:18] <moritzm>	 worth a shot IMO
[11:33:16] <elukey>	 yeah, going to investigate a bit more and then see
[11:34:49] <elukey>	 so we use hive-env.sh to set up variables
[11:34:51] <elukey>	 mmm
[11:36:00] <moritzm>	 I'm not sure if that's effective, doesn't system reset the environment when running  a unit? (unless explicitly configured via the systemd directives)
[11:37:29] <moritzm>	 but not sure (especially since the unit is generated from the init.d script)
[11:38:23] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/389496 (https://phabricator.wikimedia.org/T178504) (owner: 10Joal)
[11:38:47] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/395732 (https://phabricator.wikimedia.org/T178478) (owner: 10Joal)
[11:41:03] <wikibugs>	 (03PS2) 10Joal: [FUN] Add performance tests for scala JSON libs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/395018
[11:46:11] <wikibugs>	 (03PS1) 10Joal: Update aqs to bcbdbd3 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/395981
[11:47:25] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Self merging for later deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/395981 (owner: 10Joal)
[11:50:03] <joal>	 elukey: should we send an email letting analytics-people know we are experiencing a cluster issue?
[11:50:37] <joal>	 elukey: actually, just double checked my emails, and seen you already did - sorry for the disturbance :(
[11:52:10] <elukey>	 joal: :)
[11:52:49] <elukey>	 so there must be something that I am missing
[11:52:57] <elukey>	 I don't find in puppet where we set the 10000 port
[11:53:58] <elukey>	 ah wait hive.server2.thrift.port – TCP port number to listen on, default 10000.
[11:54:46] <elukey>	 so it should do it by itself
[11:57:44] <icinga-wm>	 RECOVERY - Hive Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2
[11:57:44] <elukey>	 also removed the prometheus javaagent config, nothing
[11:58:33] <elukey>	 hive still doesn't work
[12:05:15] <elukey>	 what the hell it was the javaagent
[12:05:16] <icinga-wm>	 RECOVERY - Hive Metastore on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore
[12:05:55] <elukey>	 yeah all good now
[12:07:09] <elukey>	 oh boy what a reboot
[12:07:32] <joal>	 elukey: <3
[12:08:05] <joal>	 elukey: How may I help now?
[12:08:28] <elukey>	 so for some reason, -javaagent blocks other things. It is like it takes over and does weird things
[12:08:54] <moritzm>	 what fixed it?
[12:09:09] <joal>	 elukey: I acually don't know what javaagent does for us
[12:09:09] <elukey>	 removing the -javagent things for prometheus
[12:09:34] <elukey>	 it was going out with this reboot
[12:09:44] <elukey>	 but hive picked it up correctly
[12:10:00] <elukey>	 so I think that it had the side effect to prevent other things to be set
[12:10:47] <moritzm>	 ok
[12:11:21] <elukey>	 thanks for the support moritzm 
[12:15:18] <elukey>	 ok joal removed from puppet the java agent config
[12:15:28] <elukey>	 oozie and hive didn't like prometheus
[12:15:37] <joal>	 elukey: can you tell me more about java agent?
[12:15:38] <elukey>	 I will try in labs to reproduce the problem
[12:15:59] <elukey>	 sure
[12:16:11] <elukey>	 the prometheus jmx exporter runs as -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=%{::ipaddress}:9183:/etc/hive/prometheus_hive_metastore_jmx_exporter.yaml
[12:16:20] <elukey>	 (changing the parameters of course)
[12:16:39] <elukey>	 now in every other daemon, it worked perfectly
[12:16:53] <elukey>	 druid, analytics100[12] + all worker nodes
[12:16:56] <elukey>	 cassandra
[12:17:04] <elukey>	 but with oozie hive is not behaving well
[12:17:29] <elukey>	 the main issue with oozie was explicit in the error logs, so I was able to figure it out straight away
[12:17:54] <elukey>	 for hive it was not that easy since the agent was working fine, exposing metrics
[12:18:05] <elukey>	 and hive* (the processes) were running
[12:18:10] <elukey>	 but not setting their port
[12:19:17] <moritzm>	 yw
[12:19:20] <elukey>	 joal: if you are ok I'd re-enable everything
[12:19:36] <elukey>	 there is already a job running
[12:19:58] <joal>	 elukey: let's reenable, I'll keep an eye on jobs
[12:21:11] <joal>	 elukey: I know you've been under pressure this morning - Would you still accepting me deploying AQS and possibly cluster later today?
[12:22:07] <elukey>	 oh yes sure
[12:22:15] <elukey>	 I am really sorry for the extra trouble
[12:22:41] <joal>	 elukey: Don't be sorry - you care our systems - Many thanks for that
[12:24:02] <elukey>	 !log camus re-enabled after analytics1003 reboot
[12:24:03] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:24:23] <elukey>	 so a good mixture of weird things
[12:24:26] <elukey>	 and nice follow ups
[12:24:49] <joal>	 elukey: Actually, I am sorry, for not being able to help in those situations
[12:24:54] <elukey>	 I am particularly worried about the fact that the hive processes were not dying when doing stop
[12:25:12] <joal>	 elukey: yes, silent failures are the worst
[12:25:31] <elukey>	 nono self caused, I didn't know that port 10000 wasn't set explicitly so I lost time on a false lead
[12:38:39] <joal>	 elukey: I suggest I go for he deploy after your lunch :)
[12:53:12] <wikibugs>	 10Analytics-Kanban, 10User-Elukey: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943#3819866 (10elukey) The last reboot (analytics1003) was particularly painful due to several issues happening in a row.  Timeline of events in UTC:  * [10:12] Reboot of analy...
[12:53:25] <elukey>	 joal: timeline in --^
[12:57:19] <joal>	 elukey: good doc, :)
[12:57:27] <moritzm>	 elukey: at least the boot time after your 10:50 reboot was quick; 14s, 55s user space, so maybe this was in fact a case of needing to lots of cleanups for the previous boot
[12:58:33] <wikibugs>	 10Analytics-Kanban, 10User-Elukey: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943#3819887 (10elukey) Worth to mention, found this while investigating:  ``` elukey@analytics1003:/var/log/mylvmbackup$ less analytics-meta.log [...] Can't locate File/Copy/Re...
[12:59:06] <elukey>	 moritzm: adding it to the timeline :D
[13:01:29] <elukey>	 all right, going out for lunch!
[13:01:32] <elukey>	 brb in a bit!
[13:01:35] * elukey lunch!
[14:10:16] <ottomata>	 whoa elukey those are old analytics meta backups!
[14:10:17] <ottomata>	 yikes
[14:11:35] <ottomata>	 i'm looking into that now...
[14:15:42] <elukey>	 ottomata: hiiiii
[14:16:26] <ottomata>	 guess we need some alerts on backup age...
[14:16:32] <elukey>	 +1
[14:22:57] <ottomata>	 elukey:  is it possible some mariadb/mysql stuff/defaults changed when there was that puppet refactor?
[14:23:09] <ottomata>	 dunno why this perl dep is missing all of the sudden, but there are other problems too
[14:24:31] <elukey>	 it seems that the last backup was on Sept 22nd though
[14:24:39] <elukey>	 I did the refactor some days ago
[14:24:50] <elukey>	 and IIRC nothing really changed when running puppet
[14:24:56] <elukey>	 but I might have missed something
[14:26:28] <ottomata>	 hm, e.g. the name of the lvm volume is wrong
[14:26:40] <ottomata>	 and mylvmbackup can't find the my.cnf file
[14:28:05] <elukey>	 that one might be my fault, but I thought it was the was before/after the refactor
[14:28:22] <elukey>	 what should be there ? And what value we have now?
[14:29:24] <ottomata>	 elukey:  https://gerrit.wikimedia.org/r/396010
[14:33:37] <elukey>	 good to me
[14:36:02] <wikibugs>	 10Analytics, 10Analytics-Cluster: Alert on age of backups on analytics1002 - https://phabricator.wikimedia.org/T182327#3820110 (10Ottomata)
[14:37:21] <ottomata>	 elukey:  did you see: https://gerrit.wikimedia.org/r/#/c/392978/ ?
[15:03:11] <elukey>	 !log restart webrequest-misc load job (Dec 7 2017 06:00:00)
[15:03:13] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[15:07:13] <ottomata>	 elukey:  comments responded to, jenkins happy :)
[15:11:57] <elukey>	 lgtm
[15:12:12] <ottomata>	 :)
[15:32:25] <wikibugs>	 10Analytics, 10DBA, 10Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1532172 (10Marostegui) Hi!  What's the status of this?
[15:34:55] <wikibugs>	 10Analytics, 10DBA, 10Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1532183 (10elukey) >>! In T108850#3820300, @Marostegui wrote: > Hi! >  > What's the status of this?  Going to be done before EOQ :)
[15:35:23] <wikibugs>	 10Analytics, 10DBA, 10Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3820313 (10Marostegui) Awesome! Thanks! :-)
[15:45:20] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Add link to new wikistats 2.0 to wikistats 1.0 pages - https://phabricator.wikimedia.org/T182001#3820356 (10Milimetric) Erik, I just created the talk page and added a welcome message: https://wikitech.wikimedia.org/wiki/Talk:Analytics/Systems/Wikistats  Your version o...
[15:50:01] <wikibugs>	 10Analytics-Kanban: Alert on age of backups on analytics1002 - https://phabricator.wikimedia.org/T182327#3820377 (10elukey) p:05Triage>03High
[15:57:03] <wikibugs>	 10Analytics-Kanban, 10Analytics-Wikistats: Add link to new wikistats 2.0 to wikistats 1.0 pages - https://phabricator.wikimedia.org/T182001#3820397 (10Nuria) Nice, thanks for getting this done.
[16:00:04] <nuria_>	 ping joal ottomata 
[16:01:02] <ottomata>	 a!
[16:01:37] <wikibugs>	 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3820420 (10elukey)
[16:04:51] <wikibugs>	 10Analytics-Kanban, 10Pageviews-API, 10Services (watching): Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3820442 (10Milimetric) Verified - everything has CORS now, thanks to @MusikAnimal for the report.
[16:07:56] <wikibugs>	 10Analytics-Kanban, 10Pageviews-API, 10Services (watching): Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3820470 (10Pchelolo) 05Open>03Resolved Indeed thanks to @MusikAnimal. Resolving.
[16:12:00] <wikibugs>	 10Analytics-Kanban, 10Pageviews-API, 10Services (watching): Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3820527 (10mobrovac)
[17:22:27] <wikibugs>	 (03PS1) 10Thiemo Mättig (WMDE): Record metrics for Wikidata task priorities (via color) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/396065
[17:26:42] <elukey>	 ottomata: just checked the lvm backup, much better now!
[17:26:46] <elukey>	 \o/
[17:27:48] <milimetric>	 joal / elukey: aqs deployment is up, checked on a couple hosts
[17:27:50] <milimetric>	 looks good
[17:28:31] <joal>	 milimetric: can you let me know when done? I'll heck my metrics :)
[17:28:37] <milimetric>	 joal: done
[17:28:44] <joal>	 hehe :)
[17:30:02] <elukey>	 so joal the new version of the druid_exporter is ready
[17:30:07] <elukey>	 I'll install it on Monday
[17:30:20] <elukey>	 ah forgot to tell you guys, tomorrow is bank holiday in italy :)
[17:30:23] <joal>	 Super elukey :)
[17:34:07] <joal>	 milimetric: I confirm my metrics match !
[17:34:07] <joal>	 This is a MATCH !
[17:34:47] <milimetric>	 :) yay
[17:34:51] <milimetric>	 we have a match
[17:34:52] <milimetric>	 gogogogo
[17:36:19] <joal>	 Man - I'm trying it onsite - looks GOOOOOOd :)
[17:39:59] * elukey off! 
[17:40:01] <elukey>	 byyyeee
[17:41:52] <joal>	 Bye elukey :)
[17:42:22] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "(I don’t have +2 rights in this repository)" (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/396065 (owner: 10Thiemo Mättig (WMDE))
[17:42:35] <joal>	 milimetric: Only thing I think misses sa of now in UI is ability to filer by both page and editor type (like, in order to have user on content activity for instance)
[17:43:27] <milimetric>	 joal: yeah, that was on purpose in this level of the UI.  And we were thinking of making a huge table that displays and filters/slices all the data as an advanced interface
[17:43:39] <joal>	 makes sense milimetric 
[17:43:55] <ottomata>	 :)
[17:44:03] <ottomata>	 elukey:  how's your ldap foo
[17:44:03] <joal>	 just double checked edits metric - it matches wikistats closely :)
[17:44:04] <ottomata>	 ?
[17:44:09] <ottomata>	 oh you are off!
[17:44:39] <joal>	 I'm super happy milimetric - Thanks a lot for having dpeloyed :)
[17:45:04] <milimetric>	 psh, joal I did nothing, thanks for the two years of blood sweat and tears
[17:46:02] * joal whistle 'spinnin
[17:46:02] <joal>	 https://www.youtube.com/watch?v=kK62tfoCmuQ
[17:49:13] <wikibugs>	 (03CR) 10Joal: [C: 032] Correct clickstream job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/391193 (owner: 10Joal)
[17:50:07] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/390226 (https://phabricator.wikimedia.org/T175844) (owner: 10Joal)
[17:50:55] <joal>	 milimetric: If you're ok we can move on deploying refinery-source
[17:51:05] <joal>	 All the things I wanted are in there
[17:51:53] <milimetric>	 sweet, joal I'll start deploying -source
[17:52:26] <joal>	 milimetric: in the meantime I'm checking jar versions where needed, and will provide patches
[17:53:12] <milimetric>	 joal: for refinery, right?  So I can deploy that after -source
[17:53:38] <joal>	 milimetric: correct  - By the way, please wait a few minutes- jenkins is about to merge a patch
[17:53:59] <milimetric>	 joal: k, ping me
[17:54:55] <wikibugs>	 (03Merged) 10jenkins-bot: Correct clickstream job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/391193 (owner: 10Joal)
[17:55:42] <joal>	 milimetric: --^ done :)
[17:55:50] <milimetric>	 k, resuming
[17:56:04] <joal>	 milimetric: you follow the doc I assume, right?
[17:56:10] <milimetric>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source
[17:56:52] <joal>	 super :)
[17:57:13] <joal>	 milimetric: a note not to forget changelog.md ;)
[17:57:45] <milimetric>	 I follow directions exactly, because of my dyslexia I end up reading them 30 times :)
[17:57:58] <joal>	 huhu :)
[18:13:07] <wikibugs>	 (03PS1) 10Joal: Update jar version in mediawiki-history job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396073
[18:13:27] <joal>	 milimetric: --^ this one is the only I need from what I have seen
[18:13:59] <joal>	 Gone for diner, back after
[18:18:31] <wikibugs>	 (03PS1) 10Milimetric: Update changelog for v0.0.55 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/396074
[18:18:58] <wikibugs>	 (03CR) 10Milimetric: [V: 032 C: 032] Update changelog for v0.0.55 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/396074 (owner: 10Milimetric)
[18:26:13] <wikibugs>	 (03CR) 10Milimetric: [C: 032] Update jar version in mediawiki-history job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396073 (owner: 10Joal)
[18:29:23] <wikibugs>	 (03CR) 10Milimetric: [V: 032 C: 032] Update jar version in mediawiki-history job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396073 (owner: 10Joal)
[18:32:45] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Operations: stat1004 - /mnt/hdfs is not accessible - https://phabricator.wikimedia.org/T182342#3820871 (10Dzahn)
[18:39:21] <milimetric>	 !log Deployed refinery-source using jenkins
[18:39:22] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:41:20] <milimetric>	 Joseph, when you're back, all's deployed ^ you can restart your job whenever.
[18:47:39] <wikibugs>	 (03CR) 10Addshore: "Yup, right now I think I might be one of the only ones, purely because I am one of the only ones that can test it :/" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/396065 (owner: 10Thiemo Mättig (WMDE))
[18:47:48] <wikibugs>	 (03CR) 10Addshore: "/ test it on the machine that it runs from" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/396065 (owner: 10Thiemo Mättig (WMDE))
[19:01:29] <milimetric>	 gotta run to the doc, but I’m getting irc on my phone now, so here if you need me
[19:04:19] <GoranSM>	 How bad is https://phabricator.wikimedia.org/T182342?
[19:30:13] <joal>	 Awesome milimetric 
[19:30:33] <joal>	 !log Deploying refinery now that -source is deployed
[19:30:34] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:44:57] <joal>	 heya ottomata - are you here ?
[19:45:33] <ottomata>	 ya
[19:45:33] <ottomata>	 heya
[19:45:43] <joal>	 ottomata: I need some help wih scap :(
[19:46:11] <joal>	 ottomata: it failed deploying refinery from tin onto stat1005 with a git-fat error
[19:46:24] <joal>	 ottomata: do you have a minute for that?
[19:47:03] <ottomata>	 joal sure
[19:47:05] <ottomata>	 i will find the hammer
[19:47:12] <joal>	 :D
[19:47:20] * joal loves ottomata's ways
[19:48:56] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3821071 (10Pchelolo) After a day of running the jobs for wiktionaries I don't see any issues at all, but on the contrary I don't really see any dedupli...
[19:58:45] <ottomata>	 joal ok, mostly fix, but there seems to be an artifact that is not in archiva
[19:59:23] <joal>	 hm - really ? That's super weird !
[19:59:24] <ottomata>	 hmm wait a minute
[19:59:29] <joal>	 ottomata: more info on which?
[19:59:58] <ottomata>	 trying to find out but.... hmmm git-fat still didn't work right
[20:01:57] <ottomata>	 joal they are all your new .55 files
[20:02:48] <joal>	 ottomata: ???
[20:02:52] <ottomata>	 artifacts/org/wikimedia/analytics/refinery/refinery-hive-v0.0.55.jar:#$# git-fat da39a3ee5e6b4b0d3255bfef95601890afd80709                    0
[20:02:53] <ottomata>	 but also
[20:03:00] <ottomata>	 all your 0.55 files have the same sha
[20:03:06] <ottomata>	 core, cassandra, job, etc.
[20:03:08] <ottomata>	 all the same sha
[20:03:23] <joal>	 hm - milimetric deployed that, probably through regular jenkins I imagine
[20:05:17] <ottomata>	 ok joal the files are in archiva
[20:05:17] <milimetric>	 I did, and I had to do that extra build in jenkins because the first try still had the v54 in the Release Version box
[20:05:17] <ottomata>	 its just that refinery/artifacts are wrong
[20:05:17] <ottomata>	 so, you can probably remote your .55 jars
[20:05:17] <ottomata>	 and re-add them manually (with git fat) and comit
[20:05:22] <milimetric>	 weird, I followed the instructions exactly
[20:05:44] <joal>	 ottomata: How do you think we should move now? Shall I ry to redeploy v0.0.56?
[20:05:56] <ottomata>	 archive and the 0.55 jars are fine
[20:06:04] <ottomata>	 its just refinery/artifacts that have the wrong git fat shas
[20:06:09] <ottomata>	 if you DL the 0.55 jars
[20:06:18] <ottomata>	 and git add them (with a properly git fat inited repo)
[20:06:24] <joal>	 ottomata: Maybe we can try the jenkins way anew?
[20:06:27] <ottomata>	 you should be able to make a manual commit that properly adds
[20:06:34] <ottomata>	 joal:  sure, but i dont' think yo need to make a new release
[20:06:36] <ottomata>	 but up to you
[20:06:41] <joal>	 I'll try that first
[20:06:49] <ottomata>	 can you just do the refinery add files step?
[20:06:53] <joal>	 and if not success I'll go for manual
[20:07:19] <joal>	 ok for you ottomata?
[20:07:48] <ottomata>	 k
[20:07:49] <ottomata>	 ya
[20:07:49] <ottomata>	 sure
[20:08:40] <milimetric>	 my local repo didn’t have git fat because it’s my new computer, but I didn’t think that would matter since I did everything on jenkins
[20:08:44] <joal>	 git log
[20:08:47] <joal>	 oops
[20:09:39] <joal>	 milimetric: from commit message, I can tell that there have been an error in deploy conf for linking jars into refinery
[20:09:39] <joal>	 I had not noticed it, but now I see --> Add refinery-source jars for vv0.0.55 to artifacts
[20:10:18] <joal>	 double v -- I hink you've written v0.0.55 in the release-version field of jenkins, while it was waiting for just the numbers
[20:10:38] <milimetric>	 Oh!  I must have.  Ok, I’ll edit and add a note about that
[20:10:51] <joal>	 This sequence of action for deploy is very misleading - in the previous jenkins box, you need to put v0.0.X, and in that one, just 0.0.X
[20:11:02] <joal>	 We should actually patch he process
[20:11:24] <joal>	 milimetric: I recall having been there before, and told myself how conter-intuitive this hing is
[20:12:27] <joal>	 !log Trying to deploy refinery again
[20:12:28] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:12:57] <milimetric>	 joal: I think I noticed it before and avoided the mistake, but I talked to madhu and she said it wasn’t possible to change somehow
[20:13:14] <milimetric>	 for now, I’ll update the docs
[20:13:20] <joal>	 Many thanks :)
[20:14:01] <joal>	 ottomata: we still have the same issue - I'll remove the wrong jars (adding new ones is not enough, obviously -- facepalm)
[20:15:38] <ottomata>	 ayeee ok cool
[20:16:00] <wikibugs>	 (03PS1) 10Joal: Remove wrong artifacts from directory [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396102
[20:16:29] <wikibugs>	 (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396102 (owner: 10Joal)
[20:20:29] <joal>	 Success ! Thanks a lot ottomata and milimetric :)
[20:20:39] <ottomata>	 phew, great!
[20:22:39] <elukey>	 ottomata: o/ - forgot to ask if we need to keep kafka10* with puppet disabled
[20:23:07] <ottomata>	 oh!  no
[20:23:15] <ottomata>	 that was an oversight on my part
[20:23:16] <ottomata>	 enabling
[20:26:07] <joal>	 ottomata: any info on the deployment schedule discussed in ops?
[20:26:56] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Icinga, 10Operations: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#3821198 (10Dzahn) Is there a ticket to get eventlog2001 back into production? It is in site.pp but doesn't have any roles. Adding it with ro...
[20:28:02] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Icinga, 10Operations: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#1840331 (10Ottomata) > Is there a ticket to get eventlog2001 back into production? It never was in production.
[20:28:22] <wikibugs>	 10Analytics, 10Analytics-EventLogging, 10Icinga, 10Operations: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#3821203 (10Dzahn) So it should be decom'ed?
[20:31:43] <wikibugs>	 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276#3821214 (10EBernhardson) I suppose for a little more background on what i think is happening:  * The executors that die seem to be the ones that...
[20:44:35] <wikibugs>	 10Analytics, 10Discovery, 10Discovery-Analysis, 10Discovery-Search: UDF for language detection - https://phabricator.wikimedia.org/T182352#3821234 (10mpopov) p:05Triage>03Normal
[20:45:21] <joal>	 !log Kill restbase oozie job and restart apis replacing one
[20:45:42] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[20:48:31] <wikibugs>	 10Analytics, 10Discovery, 10Discovery-Analysis, 10Discovery-Search: UDF for language detection - https://phabricator.wikimedia.org/T182352#3821256 (10mpopov)
[20:52:32] <ottomata>	 joal:  still here?
[20:52:38] <joal>	 yup
[20:52:43] <joal>	 whassup ottomata ?
[20:52:46] <ottomata>	 can you test something for me?
[20:53:12] <joal>	 sure
[20:53:18] <ottomata>	 edit your /etc/hosts and set an alias for 127.0.0.1 superset.wikimedia.org
[20:53:19] <ottomata>	 then
[20:53:22] <ottomata>	 ssh -N thorium.eqiad.wmnet -L 9081:thorium.eqiad.wmnet:80
[20:53:31] <ottomata>	 then
[20:53:42] <ottomata>	 http://superset.wikimedia.org:9081
[20:53:48] <ottomata>	 i want to see if you can authenticate with your ldap username and pw
[20:53:56] <ottomata>	 and then also if i can promote you to an admin user
[20:55:26] <joal>	 ottomata: ERR_SSL_PROTOCOL_ERROR
[20:55:44] <ottomata>	 joal i got that too in chrome, dunno why its redirecting to ssl
[20:55:51] <ottomata>	 you in chrome?
[20:56:25] <ottomata>	 safari worked for me (can't remember if you are on osx or not)
[20:56:51] <joal>	 yes - just tried in ff - failed with same
[20:56:59] <joal>	 nope
[20:57:06] <joal>	 debian
[20:57:43] <ottomata>	 huh weird
[20:57:48] <ottomata>	 dunno where that redirect comes from
[20:58:08] <ottomata>	 ok, i'll have to get the full lvs stuff up first then
[20:58:58] <joal>	 sorry ottomata - trying to look into network inside chrome
[21:00:21] <ottomata>	 np
[21:00:23] <ottomata>	 no worries
[21:00:33] <ottomata>	 it can wait til tomorrow sometime
[21:00:52] <joal>	 I actually can't see what is redirecting me
[21:00:57] <joal>	 we'll see tomorrow
[21:02:26] <joal>	 milimetric: refinery is deployed - I'll move jobs to prod (changing table, new one, new jobs etc) tomorrow my morning
[21:02:30] <joal>	 milimetric: ok for you ?
[21:02:37] <joal>	 Like that I'll have time to fix :)
[21:03:29] <milimetric>	 great, ok with me joal
[21:03:56] <milimetric>	 that way people have time to respond to the email
[21:04:35] <joal>	 true :)
[21:09:52] <joal>	 !log Start clickstream oozie job
[21:09:53] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[21:13:59] <joal>	 Gone for tonight a-team
[21:27:55] <ottomata>	 joal:  laters!  if you want to try, its up!
[21:28:01] <ottomata>	 https://superset.wikimedia.org
[21:56:26] <wikibugs>	 10Analytics, 10Discovery, 10Discovery-Analysis, 10Discovery-Search: UDF for language detection - https://phabricator.wikimedia.org/T182352#3821459 (10TJones) So, I think this is a very nifty idea, but there are some potential pitfalls to be aware of.  The claim (copied in the Java port from the original) t...
[22:51:45] <wikibugs>	 10Analytics, 10Discovery, 10EventBus, 10Wikidata, and 4 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3821621 (10Pchelolo) Hm, actually if I just try to consume from that topic (any topic actually) with `-F "%T"` that should give me message timestamps i...
[23:19:37] <wikibugs>	 10Analytics-Kanban, 10Patch-For-Review, 10Services (watching): Add action api counts to graphite-restbase job - https://phabricator.wikimedia.org/T176785#3821743 (10Pchelolo) I guess that has been done since I was able to add the Action API graph to the API summary dashboard: https://grafana.wikimedia.org/da...
[23:23:34] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#3821751 (10Pchelolo)
[23:23:35] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Add the ability to sign and verify jobs - https://phabricator.wikimedia.org/T174600#3821748 (10Pchelolo) 05Open>03Resolved The signing/verification has been implemented. Resolving.
[23:26:58] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Make Kafka JobQueue use Special:RunSingleJob - https://phabricator.wikimedia.org/T182372#3821757 (10Pchelolo) p:05Triage>03Normal
[23:28:50] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#3821776 (10Pchelolo)
[23:28:53] <wikibugs>	 10Analytics, 10ChangeProp, 10EventBus, 10Services (done): Support topic arrays in ChangeProp config - https://phabricator.wikimedia.org/T175727#3821773 (10Pchelolo) 05Open>03declined Since we've moved the job specifications into vars.yaml this is no longer required.
[23:41:03] <wikibugs>	 10Analytics, 10Discovery, 10EventBus, 10Wikidata, and 4 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3821787 (10Nuria) I got same doing:  /home/otto/kafkacat -Q  -b kafka-jumbo1003.eqiad.wmnet -t eqiad.mediawiki.revision-create:0:1512687299  -Xdebug=al...