[00:20:32] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3818603 (10Pchelolo) For the reference, next time we migrate recursive jobs we need to switch off Redis queue production before switching on Kafka cons... [02:30:45] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3818753 (10Pchelolo) The backlog was cleared now, all seems in good shape. [03:17:44] 10Analytics-Kanban, 10Analytics-Wikistats: Handle error due to lack of data - https://phabricator.wikimedia.org/T182224#3817033 (10Milimetric) [03:29:43] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274#3818793 (10Pchelolo) p:05Triage>03Low [03:30:04] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Create custom per-job metric reporters capability - https://phabricator.wikimedia.org/T182274#3818807 (10Pchelolo) [05:10:14] 10Analytics, 10Analytics-Cluster: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276#3818910 (10EBernhardson) [06:50:26] ebernhardson: really interested in that, please keep me in the loop! :) [08:31:49] !log stop camus on an1003 as prep step for reboot [08:31:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:45:49] Hi elukey [08:45:59] here to help if needed [08:47:57] :) [08:48:12] just released the druid_exporter 0.5 with realtime metrics [08:48:37] Man, this is just awesome :) [08:48:39] and deployed the puppet change to enable middle managers to push metrics (well peons to be precise :) [08:48:57] now I am checking up some metrics about segments since I think there might be a bug [08:49:15] elukey: I've killed the realtime job - I need to deep dive into it more cause it has failed like 5 times yesterday [08:50:28] joal: really interested to figure out what's wrong, lemme know if I can help [08:50:39] ahhh! [08:50:39] druid_historical_segment_used_bytes{datasource="_default_tier"} 31841088057.0 [08:50:43] [..] [08:50:45] druid_historical_segment_used_bytes{datasource="_default_tier"} 96799895856.0 [08:50:48] druid_historical_segment_used_bytes{datasource="_default_tier"} 22594884516.0 [08:50:51] this is definitely not right [08:50:53] uffffff [08:51:25] after the restart of the exporter historical bytes used jumped to 29G [08:51:36] and i was like "what?" [08:54:03] they don't have the datasource [08:54:18] or better, they have it wrong [08:54:48] ah yes because I am stupid [09:10:56] so this time I need to find a proper solution for something that I tried to postpone [09:11:12] namely the fact that metric names are not unique (like segment/used) [09:11:18] hm [09:11:48] sorry joal I was dumping my thoughts without a lot of context [09:11:59] no worries elukey [09:12:04] so segment/used can be emitted by historical and coordinator [09:12:21] but with different dimensions (tier + datasource vs datasource only) [09:12:41] I thought I found a solution to use one data structure for all the metrics [09:12:56] but then I had to do little hacks to preserve this [09:13:03] now I am going to do things properly :) [09:14:10] hi aaaallll! [09:14:31] Hi mforns [09:16:27] o/ [09:30:40] joal: from yarn's UI it seems that we only have 3 hive queries left from various users [09:30:51] I'd wait a bit more and then proceed with the reboot [09:30:52] elukey: watching [09:31:15] elukey: given you sent an email yesterday, I say go for it [09:31:28] I know bearloga's one will rerun automatically - for thoers I don't know [09:32:13] maybe I can wait amire80's one to complete (shouldn't take long) [09:32:22] correct [09:32:46] and actually elukey, tilman's one is also well advanced [09:33:21] ok let's wait another say hour [09:33:24] and then decide [09:33:42] elukey: hopefull it'll be faster than thtat [09:34:41] * elukey is always the pessimist [09:34:59] joal: I just want to highlight the fact that this time I've only stopped camus :P [09:35:03] * elukey hides [09:35:55] huhuhu elukey :) [09:38:39] joal: <# [09:38:40] <3 [09:43:02] he he elukey! [09:43:28] just wondering, what time did you steal notebook1002 yesterday? [09:44:17] addshore: still haven't done anything, just copied homedirs over to notebook1001 [09:44:23] addshore: do you use it? [09:44:38] nope [09:45:04] but i noticed a bunch of PAWS edits stop on wikidata @ 23:00 yesterday so just wondered if it had antyhing to do with it :) [09:45:15] maybe a user got missed, but as you havn't touched anything yet i guess now! [09:46:12] nono nothing has been done so far afaik [09:46:44] https://tools.wmflabs.org/sal/production?p=0&q=notebook&d= seems to confirm that [10:11:02] joal: amire80 seems to have fired another query, maybe it is an automatic job? [10:11:29] elukey: I think it is [10:11:37] elukey: tilman's one is finished - Let's go [10:11:40] super [10:11:43] this was the huge one [10:13:09] ah snap one thing might be problematic [10:13:25] The druid clusters are using mysql on analytics1003 [10:13:37] mwarf [10:13:47] the prod one as well I guess [10:13:52] both yes [10:13:55] Marf bis [10:14:04] I think we need to stop them [10:14:49] I am sure that the overlord uses the db [10:15:04] realtime nodes too (but we are good) [10:15:21] so in theory, a brief downtime should not be cause a major issue to Druid [10:15:27] yes - realtime is down for now [10:15:29] the overlord should complain a bit in the logs [10:16:06] but then it should recover once the db is up [10:16:22] we are not indexing anything right now [10:16:52] ok - let's move then [10:18:31] rebooting in a min [10:23:20] so it seems stuck in something like https://www.reddit.com/r/debian/comments/2jyquk/systemd_issue_at_boot_a_start_job_is_running_for/ [10:25:10] moritzm: (if you have time) [10:25:51] after rebooting analytics1003 (a bit important for us) the host is stuck in "Create volatile files and .." [10:25:59] well systemd is stuck [10:26:27] let me have a look [10:27:48] elukey: it's back up [10:28:21] moritzm: so only mentioning your name scared it away? :D [10:28:31] I didn't do anything (when I had logged in via mgmt, I saw the last startup line for oozie and then the console prompt came up [10:28:45] probably yes, it's like saying Candyman a few times in front of a mirror [10:28:53] ahhahha [10:28:55] I [10:29:12] I'll try to see if I can follow up on this issue, I don't like it a lot :) [10:29:13] but I'll have a look whether I can spot something, you can analyse startup times post-boot with systend [10:29:39] thanks a lot! [10:31:21] wow, 13 seconds kernel startup and 6:22 mins (!) userspace [10:31:49] yeah :( [10:31:58] 6:10 of that spent in systemd-tmpfiles-setup.service [10:32:53] maybe a bloated /tmp? [10:33:17] mmm doesn't seem so [10:33:59] ah "systemd-tmpfiles creates, deletes, and cleans up volatile and temporary files and directories, based on the configuration file format and location specified in tmpfiles.d(5)." [10:34:34] yeah, but there are none :-) [10:34:43] I'll try running the command manually [10:35:16] ends in a few ms [10:36:24] maybe /tmp was huge due to hadoop garbage and systemd-tmpfiles spent a huge time cleaning? [10:36:47] joal: everything seems fine to me, shall I re-enable camus? [10:39:26] those cleanups should have happened during the system shutdown as part of the reboot alredy [10:39:54] I have no idea, all the tmpfiles configs seem common to all other servers [10:40:19] it's /etc/tmpfiles.d, /run/tmpfiles.d and /usr/lib/tmpfiles.d [10:40:34] maybe one of those was blocked on an hfds mount or so? [10:40:51] no idea [10:41:56] I'll investigate! [10:42:24] PROBLEM - Hive Metastore on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [10:42:44] PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [10:42:56] ssh analytics1003.eqiad.wmnet [10:42:56] System is booting up. See pam_nologin(8) [10:42:57] Authentication failed. [10:43:00] whatttt [10:46:10] joal: this is really bad [10:46:32] O.o [10:47:15] aouch elukey [10:47:23] I can see System is booting up. See pam_nologin(8) [10:51:08] host is booting, I had to powercycle it [10:53:12] elukey: hm - I'll need more explanations on power hardware [10:54:19] so an1003 is back up [10:54:56] elukey: succesfully back up, or in error-mode back up? [10:55:24] PROBLEM - Hive Metastore on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [10:55:44] PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [10:55:56] hm - doesn't smeel good :( [10:56:48] mysql is up, hive doesn't like a lot what happened [10:57:23] I have no idea why that host rebooted [10:57:38] elukey: let me know how you wish to help [10:57:41] 10Analytics-Kanban, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db104[67] - https://phabricator.wikimedia.org/T181784#3819562 (10Marostegui) [10:57:43] log investigation? [10:58:01] joal: now hive seems up [10:58:48] elukey: connection from stat1004 doesn't work (hive) [10:59:09] :Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused (Connection refused) [11:00:05] in theory I see tcp 0 0 0.0.0.0:9978 0.0.0.0:* LISTEN 1412/java [11:00:14] and 1412 is hive-server [11:00:16] several of the system services stopped/went down at 10:50:07 [11:00:44] similar to what would have happened during a controlled reboot (but there was none AFAICT) [11:00:46] moritzm: that one was me powercycling [11:01:11] !log powercycle analytics1003 - no serial console, ssh stuck in System is booting up. See pam_nologin(8) [11:01:26] ah, indeed, missed SAL [11:01:34] elukey: it seems stat1004 still can't connect [11:01:36] and before that the host was inaccessible via standard SSH? [11:01:37] joal: ah Connecting to jdbc:hive2://analytics1003.eqiad.wmnet:10000 [11:01:53] moritzm: yeah and Hive was down (server + metastore) [11:03:40] strange [11:03:57] so port 10000 is not up, trying to figure out what daemon is responsible for it [11:04:00] I thought hive-server [11:04:15] in kern.log.1 there's plenty of log messages where CPUs maxed out the temperature threshold, throttling the CPU [11:04:22] maybe some hw/thermal problem? [11:06:33] according to our ferm config 10000 is hive server, yes [11:09:34] Dec 07 11:08:46 analytics1003 hive-server2[7976]: Failed to start Hive Server2. Return value: 1 ... failed [11:09:44] tried to restart it and I got this, nothing relevant on the logs [11:09:53] might be the new prometheus settings [11:10:01] but everything was working fine befor [11:11:40] oh, hive-server really needs a native systemd unit... [11:12:46] FATAL ERROR in native method: processing of -javaagent failed [11:13:40] fails to bind, Address already in use [11:13:57] maybe the prometheus exporter blocks some port which is expected by hive? [11:14:21] in theory no [11:15:14] or the jmx stuff? it's at least part of the traceback (io.prometheus.jmx* below) [11:15:19] ah yes in netstat I can see the ports already used by old processes [11:15:21] wtf [11:16:32] so systemctl restart might not be working correctly [11:16:45] GOTO: "oh, hive-server really needs a native systemd unit..." [11:16:52] that's why I mentioned a proper systemd service unit :-) [11:16:54] exactly :-) [11:17:10] it's reall hard to properly track dependencies with the service units that get derived from an init script [11:17:14] so now hive-server is up [11:17:19] but port 10000 is not [11:18:22] it seems to be using 9978, 46077, 39147 [11:18:39] 46077 and 39147 are probably ephemeral, but is 9978 expected? [11:19:17] can't find it in our puppet/ferm rules at least [11:19:50] so 51010 is the exporter [11:20:13] 9978 is jmx [11:23:11] so at this point I think that hive is not configured anymore for port 10000 [11:23:22] that might be due to the last refactoring, something that I've missed [11:23:55] I think I have a hunch [11:24:12] I ran 'dpkg -L hive-server2' [11:24:29] and it lists a /etc/default/hive-server2 which should be present per package status db [11:24:32] but it's not around [11:24:42] and it's the file which is sourced from the init script [11:24:46] which refers to $PORT [11:24:53] but $PORT isn [11:25:33] and $PORT should probably have been read from /etc/default/hive-server2 [11:25:52] or am I missing something and we intentionally kill this file in puppet? [11:26:20] I don't have all those analytics git submodules checked out I think [11:27:14] mmmm not that I know [11:27:32] so we set profile::hive::client::server_port: 10000 [11:27:40] that is in turn only used in the beeline script [11:28:13] and it was not part of my last refactoring [11:28:20] so this might be a bomb that was waiting me to restart [11:28:27] but that's for hive_client, isn't it? [11:29:32] yeah, I am checking also class { '::cdh::hive::server': } [11:29:47] we rely on having the same config everywhere, that is brought by the client class [11:29:53] (I know naming is a bit weird) [11:31:06] mhh, /etc/default/hive-server2 seems to be a red herring [11:31:24] I downloaded the pristine deb from our mirrors and it only contains [11:31:27] #PORT= [11:31:35] (and some copyright blurb) [11:32:14] we could ofc try to simply add /etc/default/hive-server2 with PORT=10000 and restart hive-server [11:32:18] worth a shot IMO [11:33:16] yeah, going to investigate a bit more and then see [11:34:49] so we use hive-env.sh to set up variables [11:34:51] mmm [11:36:00] I'm not sure if that's effective, doesn't system reset the environment when running a unit? (unless explicitly configured via the systemd directives) [11:37:29] but not sure (especially since the unit is generated from the init.d script) [11:38:23] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/389496 (https://phabricator.wikimedia.org/T178504) (owner: 10Joal) [11:38:47] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/395732 (https://phabricator.wikimedia.org/T178478) (owner: 10Joal) [11:41:03] (03PS2) 10Joal: [FUN] Add performance tests for scala JSON libs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/395018 [11:46:11] (03PS1) 10Joal: Update aqs to bcbdbd3 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/395981 [11:47:25] (03CR) 10Joal: [V: 032 C: 032] "Self merging for later deploy" [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/395981 (owner: 10Joal) [11:50:03] elukey: should we send an email letting analytics-people know we are experiencing a cluster issue? [11:50:37] elukey: actually, just double checked my emails, and seen you already did - sorry for the disturbance :( [11:52:10] joal: :) [11:52:49] so there must be something that I am missing [11:52:57] I don't find in puppet where we set the 10000 port [11:53:58] ah wait hive.server2.thrift.port – TCP port number to listen on, default 10000. [11:54:46] so it should do it by itself [11:57:44] RECOVERY - Hive Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 [11:57:44] also removed the prometheus javaagent config, nothing [11:58:33] hive still doesn't work [12:05:15] what the hell it was the javaagent [12:05:16] RECOVERY - Hive Metastore on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [12:05:55] yeah all good now [12:07:09] oh boy what a reboot [12:07:32] elukey: <3 [12:08:05] elukey: How may I help now? [12:08:28] so for some reason, -javaagent blocks other things. It is like it takes over and does weird things [12:08:54] what fixed it? [12:09:09] elukey: I acually don't know what javaagent does for us [12:09:09] removing the -javagent things for prometheus [12:09:34] it was going out with this reboot [12:09:44] but hive picked it up correctly [12:10:00] so I think that it had the side effect to prevent other things to be set [12:10:47] ok [12:11:21] thanks for the support moritzm [12:15:18] ok joal removed from puppet the java agent config [12:15:28] oozie and hive didn't like prometheus [12:15:37] elukey: can you tell me more about java agent? [12:15:38] I will try in labs to reproduce the problem [12:15:59] sure [12:16:11] the prometheus jmx exporter runs as -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=%{::ipaddress}:9183:/etc/hive/prometheus_hive_metastore_jmx_exporter.yaml [12:16:20] (changing the parameters of course) [12:16:39] now in every other daemon, it worked perfectly [12:16:53] druid, analytics100[12] + all worker nodes [12:16:56] cassandra [12:17:04] but with oozie hive is not behaving well [12:17:29] the main issue with oozie was explicit in the error logs, so I was able to figure it out straight away [12:17:54] for hive it was not that easy since the agent was working fine, exposing metrics [12:18:05] and hive* (the processes) were running [12:18:10] but not setting their port [12:19:17] yw [12:19:20] joal: if you are ok I'd re-enable everything [12:19:36] there is already a job running [12:19:58] elukey: let's reenable, I'll keep an eye on jobs [12:21:11] elukey: I know you've been under pressure this morning - Would you still accepting me deploying AQS and possibly cluster later today? [12:22:07] oh yes sure [12:22:15] I am really sorry for the extra trouble [12:22:41] elukey: Don't be sorry - you care our systems - Many thanks for that [12:24:02] !log camus re-enabled after analytics1003 reboot [12:24:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:24:23] so a good mixture of weird things [12:24:26] and nice follow ups [12:24:49] elukey: Actually, I am sorry, for not being able to help in those situations [12:24:54] I am particularly worried about the fact that the hive processes were not dying when doing stop [12:25:12] elukey: yes, silent failures are the worst [12:25:31] nono self caused, I didn't know that port 10000 wasn't set explicitly so I lost time on a false lead [12:38:39] elukey: I suggest I go for he deploy after your lunch :) [12:53:12] 10Analytics-Kanban, 10User-Elukey: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943#3819866 (10elukey) The last reboot (analytics1003) was particularly painful due to several issues happening in a row. Timeline of events in UTC: * [10:12] Reboot of analy... [12:53:25] joal: timeline in --^ [12:57:19] elukey: good doc, :) [12:57:27] elukey: at least the boot time after your 10:50 reboot was quick; 14s, 55s user space, so maybe this was in fact a case of needing to lots of cleanups for the previous boot [12:58:33] 10Analytics-Kanban, 10User-Elukey: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943#3819887 (10elukey) Worth to mention, found this while investigating: ``` elukey@analytics1003:/var/log/mylvmbackup$ less analytics-meta.log [...] Can't locate File/Copy/Re... [12:59:06] moritzm: adding it to the timeline :D [13:01:29] all right, going out for lunch! [13:01:32] brb in a bit! [13:01:35] * elukey lunch! [14:10:16] whoa elukey those are old analytics meta backups! [14:10:17] yikes [14:11:35] i'm looking into that now... [14:15:42] ottomata: hiiiii [14:16:26] guess we need some alerts on backup age... [14:16:32] +1 [14:22:57] elukey: is it possible some mariadb/mysql stuff/defaults changed when there was that puppet refactor? [14:23:09] dunno why this perl dep is missing all of the sudden, but there are other problems too [14:24:31] it seems that the last backup was on Sept 22nd though [14:24:39] I did the refactor some days ago [14:24:50] and IIRC nothing really changed when running puppet [14:24:56] but I might have missed something [14:26:28] hm, e.g. the name of the lvm volume is wrong [14:26:40] and mylvmbackup can't find the my.cnf file [14:28:05] that one might be my fault, but I thought it was the was before/after the refactor [14:28:22] what should be there ? And what value we have now? [14:29:24] elukey: https://gerrit.wikimedia.org/r/396010 [14:33:37] good to me [14:36:02] 10Analytics, 10Analytics-Cluster: Alert on age of backups on analytics1002 - https://phabricator.wikimedia.org/T182327#3820110 (10Ottomata) [14:37:21] elukey: did you see: https://gerrit.wikimedia.org/r/#/c/392978/ ? [15:03:11] !log restart webrequest-misc load job (Dec 7 2017 06:00:00) [15:03:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:07:13] elukey: comments responded to, jenkins happy :) [15:11:57] lgtm [15:12:12] :) [15:32:25] 10Analytics, 10DBA, 10Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1532172 (10Marostegui) Hi! What's the status of this? [15:34:55] 10Analytics, 10DBA, 10Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#1532183 (10elukey) >>! In T108850#3820300, @Marostegui wrote: > Hi! > > What's the status of this? Going to be done before EOQ :) [15:35:23] 10Analytics, 10DBA, 10Patch-For-Review: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3820313 (10Marostegui) Awesome! Thanks! :-) [15:45:20] 10Analytics-Kanban, 10Analytics-Wikistats: Add link to new wikistats 2.0 to wikistats 1.0 pages - https://phabricator.wikimedia.org/T182001#3820356 (10Milimetric) Erik, I just created the talk page and added a welcome message: https://wikitech.wikimedia.org/wiki/Talk:Analytics/Systems/Wikistats Your version o... [15:50:01] 10Analytics-Kanban: Alert on age of backups on analytics1002 - https://phabricator.wikimedia.org/T182327#3820377 (10elukey) p:05Triage>03High [15:57:03] 10Analytics-Kanban, 10Analytics-Wikistats: Add link to new wikistats 2.0 to wikistats 1.0 pages - https://phabricator.wikimedia.org/T182001#3820397 (10Nuria) Nice, thanks for getting this done. [16:00:04] ping joal ottomata [16:01:02] a! [16:01:37] 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Set up auto-purging after 90 days {tick} - https://phabricator.wikimedia.org/T108850#3820420 (10elukey) [16:04:51] 10Analytics-Kanban, 10Pageviews-API, 10Services (watching): Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3820442 (10Milimetric) Verified - everything has CORS now, thanks to @MusikAnimal for the report. [16:07:56] 10Analytics-Kanban, 10Pageviews-API, 10Services (watching): Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3820470 (10Pchelolo) 05Open>03Resolved Indeed thanks to @MusikAnimal. Resolving. [16:12:00] 10Analytics-Kanban, 10Pageviews-API, 10Services (watching): Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3820527 (10mobrovac) [17:22:27] (03PS1) 10Thiemo Mättig (WMDE): Record metrics for Wikidata task priorities (via color) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/396065 [17:26:42] ottomata: just checked the lvm backup, much better now! [17:26:46] \o/ [17:27:48] joal / elukey: aqs deployment is up, checked on a couple hosts [17:27:50] looks good [17:28:31] milimetric: can you let me know when done? I'll heck my metrics :) [17:28:37] joal: done [17:28:44] hehe :) [17:30:02] so joal the new version of the druid_exporter is ready [17:30:07] I'll install it on Monday [17:30:20] ah forgot to tell you guys, tomorrow is bank holiday in italy :) [17:30:23] Super elukey :) [17:34:07] milimetric: I confirm my metrics match ! [17:34:07] This is a MATCH ! [17:34:47] :) yay [17:34:51] we have a match [17:34:52] gogogogo [17:36:19] Man - I'm trying it onsite - looks GOOOOOOd :) [17:39:59] * elukey off! [17:40:01] byyyeee [17:41:52] Bye elukey :) [17:42:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "(I don’t have +2 rights in this repository)" (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/396065 (owner: 10Thiemo Mättig (WMDE)) [17:42:35] milimetric: Only thing I think misses sa of now in UI is ability to filer by both page and editor type (like, in order to have user on content activity for instance) [17:43:27] joal: yeah, that was on purpose in this level of the UI. And we were thinking of making a huge table that displays and filters/slices all the data as an advanced interface [17:43:39] makes sense milimetric [17:43:55] :) [17:44:03] elukey: how's your ldap foo [17:44:03] just double checked edits metric - it matches wikistats closely :) [17:44:04] ? [17:44:09] oh you are off! [17:44:39] I'm super happy milimetric - Thanks a lot for having dpeloyed :) [17:45:04] psh, joal I did nothing, thanks for the two years of blood sweat and tears [17:46:02] * joal whistle 'spinnin [17:46:02] https://www.youtube.com/watch?v=kK62tfoCmuQ [17:49:13] (03CR) 10Joal: [C: 032] Correct clickstream job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/391193 (owner: 10Joal) [17:50:07] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/390226 (https://phabricator.wikimedia.org/T175844) (owner: 10Joal) [17:50:55] milimetric: If you're ok we can move on deploying refinery-source [17:51:05] All the things I wanted are in there [17:51:53] sweet, joal I'll start deploying -source [17:52:26] milimetric: in the meantime I'm checking jar versions where needed, and will provide patches [17:53:12] joal: for refinery, right? So I can deploy that after -source [17:53:38] milimetric: correct - By the way, please wait a few minutes- jenkins is about to merge a patch [17:53:59] joal: k, ping me [17:54:55] (03Merged) 10jenkins-bot: Correct clickstream job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/391193 (owner: 10Joal) [17:55:42] milimetric: --^ done :) [17:55:50] k, resuming [17:56:04] milimetric: you follow the doc I assume, right? [17:56:10] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source [17:56:52] super :) [17:57:13] milimetric: a note not to forget changelog.md ;) [17:57:45] I follow directions exactly, because of my dyslexia I end up reading them 30 times :) [17:57:58] huhu :) [18:13:07] (03PS1) 10Joal: Update jar version in mediawiki-history job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396073 [18:13:27] milimetric: --^ this one is the only I need from what I have seen [18:13:59] Gone for diner, back after [18:18:31] (03PS1) 10Milimetric: Update changelog for v0.0.55 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/396074 [18:18:58] (03CR) 10Milimetric: [V: 032 C: 032] Update changelog for v0.0.55 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/396074 (owner: 10Milimetric) [18:26:13] (03CR) 10Milimetric: [C: 032] Update jar version in mediawiki-history job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396073 (owner: 10Joal) [18:29:23] (03CR) 10Milimetric: [V: 032 C: 032] Update jar version in mediawiki-history job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396073 (owner: 10Joal) [18:32:45] 10Analytics, 10Analytics-Cluster, 10Operations: stat1004 - /mnt/hdfs is not accessible - https://phabricator.wikimedia.org/T182342#3820871 (10Dzahn) [18:39:21] !log Deployed refinery-source using jenkins [18:39:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:41:20] Joseph, when you're back, all's deployed ^ you can restart your job whenever. [18:47:39] (03CR) 10Addshore: "Yup, right now I think I might be one of the only ones, purely because I am one of the only ones that can test it :/" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/396065 (owner: 10Thiemo Mättig (WMDE)) [18:47:48] (03CR) 10Addshore: "/ test it on the machine that it runs from" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/396065 (owner: 10Thiemo Mättig (WMDE)) [19:01:29] gotta run to the doc, but I’m getting irc on my phone now, so here if you need me [19:04:19] How bad is https://phabricator.wikimedia.org/T182342? [19:30:13] Awesome milimetric [19:30:33] !log Deploying refinery now that -source is deployed [19:30:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:44:57] heya ottomata - are you here ? [19:45:33] ya [19:45:33] heya [19:45:43] ottomata: I need some help wih scap :( [19:46:11] ottomata: it failed deploying refinery from tin onto stat1005 with a git-fat error [19:46:24] ottomata: do you have a minute for that? [19:47:03] joal sure [19:47:05] i will find the hammer [19:47:12] :D [19:47:20] * joal loves ottomata's ways [19:48:56] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3821071 (10Pchelolo) After a day of running the jobs for wiktionaries I don't see any issues at all, but on the contrary I don't really see any dedupli... [19:58:45] joal ok, mostly fix, but there seems to be an artifact that is not in archiva [19:59:23] hm - really ? That's super weird ! [19:59:24] hmm wait a minute [19:59:29] ottomata: more info on which? [19:59:58] trying to find out but.... hmmm git-fat still didn't work right [20:01:57] joal they are all your new .55 files [20:02:48] ottomata: ??? [20:02:52] artifacts/org/wikimedia/analytics/refinery/refinery-hive-v0.0.55.jar:#$# git-fat da39a3ee5e6b4b0d3255bfef95601890afd80709 0 [20:02:53] but also [20:03:00] all your 0.55 files have the same sha [20:03:06] core, cassandra, job, etc. [20:03:08] all the same sha [20:03:23] hm - milimetric deployed that, probably through regular jenkins I imagine [20:05:17] ok joal the files are in archiva [20:05:17] I did, and I had to do that extra build in jenkins because the first try still had the v54 in the Release Version box [20:05:17] its just that refinery/artifacts are wrong [20:05:17] so, you can probably remote your .55 jars [20:05:17] and re-add them manually (with git fat) and comit [20:05:22] weird, I followed the instructions exactly [20:05:44] ottomata: How do you think we should move now? Shall I ry to redeploy v0.0.56? [20:05:56] archive and the 0.55 jars are fine [20:06:04] its just refinery/artifacts that have the wrong git fat shas [20:06:09] if you DL the 0.55 jars [20:06:18] and git add them (with a properly git fat inited repo) [20:06:24] ottomata: Maybe we can try the jenkins way anew? [20:06:27] you should be able to make a manual commit that properly adds [20:06:34] joal: sure, but i dont' think yo need to make a new release [20:06:36] but up to you [20:06:41] I'll try that first [20:06:49] can you just do the refinery add files step? [20:06:53] and if not success I'll go for manual [20:07:19] ok for you ottomata? [20:07:48] k [20:07:49] ya [20:07:49] sure [20:08:40] my local repo didn’t have git fat because it’s my new computer, but I didn’t think that would matter since I did everything on jenkins [20:08:44] git log [20:08:47] oops [20:09:39] milimetric: from commit message, I can tell that there have been an error in deploy conf for linking jars into refinery [20:09:39] I had not noticed it, but now I see --> Add refinery-source jars for vv0.0.55 to artifacts [20:10:18] double v -- I hink you've written v0.0.55 in the release-version field of jenkins, while it was waiting for just the numbers [20:10:38] Oh! I must have. Ok, I’ll edit and add a note about that [20:10:51] This sequence of action for deploy is very misleading - in the previous jenkins box, you need to put v0.0.X, and in that one, just 0.0.X [20:11:02] We should actually patch he process [20:11:24] milimetric: I recall having been there before, and told myself how conter-intuitive this hing is [20:12:27] !log Trying to deploy refinery again [20:12:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:12:57] joal: I think I noticed it before and avoided the mistake, but I talked to madhu and she said it wasn’t possible to change somehow [20:13:14] for now, I’ll update the docs [20:13:20] Many thanks :) [20:14:01] ottomata: we still have the same issue - I'll remove the wrong jars (adding new ones is not enough, obviously -- facepalm) [20:15:38] ayeee ok cool [20:16:00] (03PS1) 10Joal: Remove wrong artifacts from directory [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396102 [20:16:29] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/396102 (owner: 10Joal) [20:20:29] Success ! Thanks a lot ottomata and milimetric :) [20:20:39] phew, great! [20:22:39] ottomata: o/ - forgot to ask if we need to keep kafka10* with puppet disabled [20:23:07] oh! no [20:23:15] that was an oversight on my part [20:23:16] enabling [20:26:07] ottomata: any info on the deployment schedule discussed in ops? [20:26:56] 10Analytics, 10Analytics-EventLogging, 10Icinga, 10Operations: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#3821198 (10Dzahn) Is there a ticket to get eventlog2001 back into production? It is in site.pp but doesn't have any roles. Adding it with ro... [20:28:02] 10Analytics, 10Analytics-EventLogging, 10Icinga, 10Operations: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#1840331 (10Ottomata) > Is there a ticket to get eventlog2001 back into production? It never was in production. [20:28:22] 10Analytics, 10Analytics-EventLogging, 10Icinga, 10Operations: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#3821203 (10Dzahn) So it should be decom'ed? [20:31:43] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Enable more accurate smaps based RSS tracking by yarn nodemanager - https://phabricator.wikimedia.org/T182276#3821214 (10EBernhardson) I suppose for a little more background on what i think is happening: * The executors that die seem to be the ones that... [20:44:35] 10Analytics, 10Discovery, 10Discovery-Analysis, 10Discovery-Search: UDF for language detection - https://phabricator.wikimedia.org/T182352#3821234 (10mpopov) p:05Triage>03Normal [20:45:21] !log Kill restbase oozie job and restart apis replacing one [20:45:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:48:31] 10Analytics, 10Discovery, 10Discovery-Analysis, 10Discovery-Search: UDF for language detection - https://phabricator.wikimedia.org/T182352#3821256 (10mpopov) [20:52:32] joal: still here? [20:52:38] yup [20:52:43] whassup ottomata ? [20:52:46] can you test something for me? [20:53:12] sure [20:53:18] edit your /etc/hosts and set an alias for 127.0.0.1 superset.wikimedia.org [20:53:19] then [20:53:22] ssh -N thorium.eqiad.wmnet -L 9081:thorium.eqiad.wmnet:80 [20:53:31] then [20:53:42] http://superset.wikimedia.org:9081 [20:53:48] i want to see if you can authenticate with your ldap username and pw [20:53:56] and then also if i can promote you to an admin user [20:55:26] ottomata: ERR_SSL_PROTOCOL_ERROR [20:55:44] joal i got that too in chrome, dunno why its redirecting to ssl [20:55:51] you in chrome? [20:56:25] safari worked for me (can't remember if you are on osx or not) [20:56:51] yes - just tried in ff - failed with same [20:56:59] nope [20:57:06] debian [20:57:43] huh weird [20:57:48] dunno where that redirect comes from [20:58:08] ok, i'll have to get the full lvs stuff up first then [20:58:58] sorry ottomata - trying to look into network inside chrome [21:00:21] np [21:00:23] no worries [21:00:33] it can wait til tomorrow sometime [21:00:52] I actually can't see what is redirecting me [21:00:57] we'll see tomorrow [21:02:26] milimetric: refinery is deployed - I'll move jobs to prod (changing table, new one, new jobs etc) tomorrow my morning [21:02:30] milimetric: ok for you ? [21:02:37] Like that I'll have time to fix :) [21:03:29] great, ok with me joal [21:03:56] that way people have time to respond to the email [21:04:35] true :) [21:09:52] !log Start clickstream oozie job [21:09:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:13:59] Gone for tonight a-team [21:27:55] joal: laters! if you want to try, its up! [21:28:01] https://superset.wikimedia.org [21:56:26] 10Analytics, 10Discovery, 10Discovery-Analysis, 10Discovery-Search: UDF for language detection - https://phabricator.wikimedia.org/T182352#3821459 (10TJones) So, I think this is a very nifty idea, but there are some potential pitfalls to be aware of. The claim (copied in the Java port from the original) t... [22:51:45] 10Analytics, 10Discovery, 10EventBus, 10Wikidata, and 4 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3821621 (10Pchelolo) Hm, actually if I just try to consume from that topic (any topic actually) with `-F "%T"` that should give me message timestamps i... [23:19:37] 10Analytics-Kanban, 10Patch-For-Review, 10Services (watching): Add action api counts to graphite-restbase job - https://phabricator.wikimedia.org/T176785#3821743 (10Pchelolo) I guess that has been done since I was able to add the Action API graph to the API summary dashboard: https://grafana.wikimedia.org/da... [23:23:34] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#3821751 (10Pchelolo) [23:23:35] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Add the ability to sign and verify jobs - https://phabricator.wikimedia.org/T174600#3821748 (10Pchelolo) 05Open>03Resolved The signing/verification has been implemented. Resolving. [23:26:58] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Make Kafka JobQueue use Special:RunSingleJob - https://phabricator.wikimedia.org/T182372#3821757 (10Pchelolo) p:05Triage>03Normal [23:28:50] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088#3821776 (10Pchelolo) [23:28:53] 10Analytics, 10ChangeProp, 10EventBus, 10Services (done): Support topic arrays in ChangeProp config - https://phabricator.wikimedia.org/T175727#3821773 (10Pchelolo) 05Open>03declined Since we've moved the job specifications into vars.yaml this is no longer required. [23:41:03] 10Analytics, 10Discovery, 10EventBus, 10Wikidata, and 4 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3821787 (10Nuria) I got same doing: /home/otto/kafkacat -Q -b kafka-jumbo1003.eqiad.wmnet -t eqiad.mediawiki.revision-create:0:1512687299 -Xdebug=al...