[06:20:12] morning! [06:24:34] !log stop camus crons on an1003 and report updater on stat1005 as prep step for cluster shutdown [06:24:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:31:18] 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959 (10elukey) >>! In T192959#4613122, @Nuria wrote: > I think this work is completed , ping @JAllemandou for confirmation As far as I know this needs us to experime... [07:07:01] Morning elukey - Thanks for early stop of crons :) [07:07:59] morning! [07:21:32] 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959 (10JAllemandou) The implemented solution is not a real one: it's an oozie check preventing running indexations on production datasources when user is not hdfs. I t... [07:24:34] 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959 (10elukey) [07:35:28] 10Analytics, 10Analytics-Cluster, 10Contributors-Analysis, 10Product-Analytics: Attempting to select all columns of mediawiki_history sometimes fails with a cryptic error message - https://phabricator.wikimedia.org/T205367 (10JAllemandou) Hi @Neil_P._Quinn_WMF , The error in the first query comes from map... [07:49:27] elukey: One single discovery-prod spark job running [07:50:14] yep, seems almost done, we might be in time for 10 CEST :) [07:51:50] elukey: actually very difficult to know - The oozie launchers tells us nothing on how much is left [07:51:59] neither does spark actually [07:53:01] the progress for the app master seems to indicate that it is almost complete no? [07:53:36] on the oozie launcher? [07:53:52] yep [07:54:27] The advancement bar on oozie launcher is not reflecting of real time [07:54:52] ah ok didn't know that, good :) [07:55:39] we can possibly ask to dcausse ? :) [07:55:42] Any oozie launcher spend most of its life with an advancement bar at that exact level - Meaning most of its job is done (it has started its child job and waits for it to finish), but you never know how much of the child job is still left [07:56:18] reading backscroll [07:56:35] dcausse: we wonder about the Spark transfer job [07:56:39] oh one of our job is still running :/ [07:57:04] dcausse: do you have an example of another similar job (couldn't find one is hadoop history :( [07:57:13] looking [07:58:35] 10Analytics, 10Patch-For-Review: Review Bacula home backups set for stat100[56] - https://phabricator.wikimedia.org/T201165 (10akosiaris) [07:58:37] joal: ack I didn't know it [07:59:02] it takes 3 hours usually [07:59:12] elukey: When looking at oozie, find the launcher, then find the child ;) [07:59:24] sure I thought it was the child [07:59:26] err [07:59:32] dcausse: weird [07:59:49] no [07:59:56] I'm completely wrong :) [08:00:51] Mon, 17 Sep 2018 13:36:59 => Tue, 18 Sep 2018 16:39:26 [08:00:53] more than 24h [08:01:03] feel free to kill it [08:01:12] elukey: ^ [08:01:17] ok let's do that - sorry for the job dcausse :( [08:01:21] np [08:01:25] thanks :) [08:01:30] elukey: killing the job [08:01:51] ack [08:02:20] !log Killing discovery transfer job to drain cluster before master replacement (application_1536592725821_38136) [08:02:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:02:37] elukey: We're good [08:03:03] elukey: only notebook kernels left, they'll restart when needed [08:03:54] yep it seems spark shells not doing nothing? (probably) [08:04:23] elukey: in any case, notebooks sheel are easy to respawn [08:04:48] silencing alarms and disabling puppet [08:05:39] let's remember that analytics1068 is still down for hw failure [08:05:48] k elukey [08:09:12] done [08:09:50] going to stop hue, hive and oozie [08:09:59] following you elukey [08:10:12] elukey: let me know if you want to have an eye on something precise [08:11:24] joal: if you can follow https://etherpad.wikimedia.org/p/analytics-swap-masters and make sure that I don't skip anything it would be good :D [08:11:43] umount /mnt/hdfs now [08:11:57] elukey: on it ! [08:12:22] elukey: shall I email about maintenance starting? [08:12:57] I sent an email yesterday as reminder, I think that we are good [08:13:02] k [08:13:18] in term of order, not sure - Puppet still to be disabled, right? [08:13:24] ok now I am going to enter safemode on an1001 -> sudo -u hdfs hdfs dfsadmin -safemode enter [08:13:54] let's also keep an eye on time, I silenced alarms for 1h [08:13:59] elukey: --^? [08:14:02] ok [08:14:16] joal: puppet already disabled [08:14:34] k :) No log in chan (or missed it), so I prefer to check :) [08:15:21] please feel free to do it anytime, two checks are better than 1 (or none :P) [08:15:35] joal: ack to enter safemode? [08:16:05] yes [08:16:12] oh no- [08:16:16] elukey: reportupdater? [08:16:37] disabled the cron on stat1005 [08:16:41] k [08:16:43] then yes [08:16:52] on stat1006 as well? [08:16:57] elukey@analytics1001:~$ sudo -u hdfs hdfs dfsadmin -safemode enter [08:16:58] Safe mode is ON in analytics1001.eqiad.wmnet/10.64.36.118:8020 [08:16:58] Safe mode is ON in analytics1002.eqiad.wmnet/10.64.53.21:8020 [08:17:08] stat1006 doesn't have access to HDFS no? [08:17:15] hm [08:17:23] Ah right, only mysql [08:17:23] only stat1005/6 [08:17:24] my bad [08:17:30] err stat1004/5 [08:17:52] saving namespace [08:19:18] elukey@analytics1001:~$ sudo -u hdfs hdfs dfsadmin -saveNamespace [08:19:18] Save namespace successful for analytics1001.eqiad.wmnet/10.64.36.118:8020 [08:19:21] Save namespace successful for analytics1002.eqiad.wmnet/10.64.53.21:8020 [08:19:24] joal: --^ [08:19:28] great :) [08:19:40] now stopping gently hdfs datanodes [08:20:08] elukey: shouldn't we stop yarn at the same time (or even first)? [08:20:55] joal: well nothing is running on hdfs and safemode is on, I think the order is ok but we can definitely flip the shutdown [08:21:06] starting with yarn then [08:21:39] elukey: true, but still functionnaly better to stop yarn first ... Sorry to disturb for nothing :S [08:22:04] nono please don't say sorry and keep telling me your thoughts, it is appreciated :) [08:22:48] elukey: I explicitely killed the notebook shells - clean state [08:23:13] super [08:23:46] all right yarn node managers stopped [08:23:59] lemme check the jvms still running as yarn [08:24:26] elukey: interesting !!! ResourceManager still see everybody up :D [08:24:35] zero containers [08:24:44] \o/ [08:24:59] shutting down yarn resource managers on an1001/2 [08:25:04] ack [08:25:43] I confirm UI is gone [08:26:13] also stopped the mapred history server on an1001 [08:26:19] now, datanodes? [08:26:22] good one [08:26:24] yes [08:27:45] and journal nodes [08:27:56] namenode first? [08:28:01] ah no wait, might be better namenodes first [08:28:04] yep :) [08:28:09] :) [08:29:40] no jvms running on an100[1,2] [08:29:45] now we can stop journalnodes [08:29:51] ok ! [08:30:18] done! [08:30:43] etherpad on track :) [08:31:14] by the way elukey - Have you double checked the new masters are in the analytics VLAN? [08:31:42] I did yes [08:31:48] Great :) [08:32:04] ok no more jvms running [08:32:44] elukey@analytics1001:~$ sudo systemctl mask hadoop-hdfs-zkfc.service [08:32:47] Created symlink from /etc/systemd/system/hadoop-hdfs-zkfc.service to /dev/null. [08:32:50] elukey@analytics1001:~$ sudo systemctl mask hadoop-hdfs-namenode.service [08:32:53] Created symlink from /etc/systemd/system/hadoop-hdfs-namenode.service to /dev/null. [08:32:56] elukey@analytics1001:~$ sudo systemctl mask hadoop-yarn-resourcemanager.service [08:32:59] Created symlink from /etc/systemd/system/hadoop-yarn-resourcemanager.service to /dev/null. [08:33:02] elukey@analytics1001:~$ sudo systemctl mask hadoop-mapreduce-historyserver.service [08:33:05] Created symlink from /etc/systemd/system/hadoop-mapreduce-historyserver.service to /dev/null. [08:33:10] elukey: can you copy the commands in the etherpad (for next time)? [08:33:15] Actually, doing it [08:33:26] and [08:33:27] elukey@analytics1002:~$ sudo systemctl mask hadoop-hdfs-zkfc.service [08:33:31] Created symlink from /etc/systemd/system/hadoop-hdfs-zkfc.service to /dev/null. [08:33:34] elukey@analytics1002:~$ sudo systemctl mask hadoop-hdfs-namenode.service [08:33:37] Created symlink from /etc/systemd/system/hadoop-hdfs-namenode.service to /dev/null. [08:33:40] elukey@analytics1002:~$ sudo systemctl mask hadoop-yarn-resourcemanager.service [08:33:43] Created symlink from /etc/systemd/system/hadoop-yarn-resourcemanager.service to /dev/null. [08:33:46] sure I can do it [08:35:05] backup?s [08:35:24] doing it now [08:35:35] (mysql first) [08:39:48] About 1/2h before icinga starts ringing again [08:42:35] ack [08:42:42] copying the namenode dirs to stat1006 now [08:42:45] (via netcat) [08:49:11] joal: done! (also checked via sha256sum) [08:49:36] Good - Puppet time I guess :) [08:50:17] yep! [08:53:59] change merged, puppet ran on conf1004->6 [08:54:56] Arf, can't join an-master1001 yet :) [08:55:11] I haven't run puppet on those hosts [08:55:16] k [08:55:29] so just confirmed that I can telnet to all the conf100[4-6] hosts from both new masters [08:55:34] firewall rules are ok [08:55:35] elukey: I don't understand the line 146 of etherpad then [08:55:42] k [08:56:00] elukey: means that puppet config is ok, but not yet applied? [08:56:04] yes [08:59:02] I am copying the namenode dirs now [08:59:11] 1001 -> an-master1001 [08:59:15] 1002 -> an-master1002 [08:59:19] yup [09:02:07] extended downtime on analytics* [09:02:15] Thanks :) [09:06:15] also I am doing things slowly to copy etc.. to make sure that things are ok [09:06:23] sounds good elukey :) [09:06:24] sha256sum are a big slow [09:06:27] *bit [09:06:37] but yeah we are not in a hurry :) [09:06:40] sha-checking is definitely safer :) [09:07:44] joal: can you double check the sha? [09:07:47] I think they are ok [09:09:02] PROBLEM - Hue Server on analytics-tool1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [09:09:04] elukey: didn't check all the numbers, but beginning and end match (at least 10 chars) - Works for me :) [09:09:26] Ahhh ! downtime on an-tools ! [09:09:39] no big issue [09:09:46] all right zookeeper cleaned [09:10:07] I think that we are ready to start the journal nodes [09:10:32] hm - puppet first, then start journal only? [09:11:13] Ah no, mask, then puppet [09:11:15] mask the datanode/nodenamager, run puppet and check journal [09:11:17] exactly [09:11:21] +1 [09:11:55] ok starting with 1028 [09:21:25] joal: all journal nodes up [09:21:36] K ! [09:21:46] Masters time [09:23:11] ah snap I found an issue that I didn't notice now [09:23:20] ? [09:23:20] should be easy to fix [09:23:38] so in the partition scheme on an-master1001 the /srv directory is using lvs [09:23:41] not /var/lib/hadoop [09:23:59] need to delete the /srv one and create a new [09:24:06] elukey: new config from partman being better? [09:25:02] nope it is not correct, since all the ops misc servers assume that /srv is where important things will be [09:25:12] but on an-master1001 it is not the case, it is /var/lib/hadoop [09:25:18] ok [09:25:28] so it actually needs manual change for that [09:26:41] yep doing it [09:33:35] /dev/mapper/an--master1001--vg-lvol0 173G 61M 173G 1% /var/lib/hadoop/name [09:33:38] better now [09:34:04] doing the same on an-master1002 [09:34:06] elukey: is the /name normal? [09:34:59] 10Analytics, 10Analytics-Wikistats: Wikispecial wikis WikiStats Zeitgeist should include talk - https://phabricator.wikimedia.org/T37195 (10Aklapper) @ezachte: Do you (still) work (or plan to work) on this issue? If you do not plan to work on this issue anymore, please remove yourself as assignee (via {nav nam... [09:35:01] 10Analytics, 10Analytics-Wikistats: WikiStats should recognize global bots - https://phabricator.wikimedia.org/T37196 (10Aklapper) @ezachte: Do you (still) work (or plan to work) on this issue? If you do not plan to work on this issue anymore, please remove yourself as assignee (via {nav name=Add Action... > A... [09:35:04] 10Analytics, 10Analytics-Wikistats: Add a "number of articles with interwiki links" column in WikiStats tables - https://phabricator.wikimedia.org/T37197 (10Aklapper) @ezachte: Do you (still) work (or plan to work) on this issue? If you do not plan to work on this issue anymore, please remove yourself as assig... [09:36:25] joal: it is the same on the current masters [09:36:26] why? [09:36:48] I wondered - The /name sounds bizzare to me, but heh, if it's the way :) [09:39:47] the /extra dir doesn't contain anything basically [09:40:14] I am currently fixing some partitions on an-master1002, we also have backup in there for the an1003's db that needs space [09:40:45] elukey: should we add those steps to the etherpad? [09:43:22] 10Analytics, 10Discovery-Analysis, 10Product-Analytics, 10Wikidata, and 2 others: Query stats dashboard not updating - https://phabricator.wikimedia.org/T204415 (10Addshore) WMDE wants to pull some data from this dashboard for some internal reporting (data needed for september). Any ETA on the fix and back... [09:43:33] there you go [09:43:33] /dev/mapper/an--master1002--vg-namenode 35G 49M 35G 1% /var/lib/hadoop/name [09:43:36] /dev/mapper/an--master1002--vg-backup 138G 61M 138G 1% /srv [09:43:39] joal: in theory no [09:44:04] I should have done it yesterday but didn't notice [09:44:05] :( [09:44:10] anyhowww [09:44:16] we are good now [09:44:19] ok :) [09:44:48] so puppet on masters it is? [09:46:19] I just copied the namenode dir files [09:46:22] now I am going to run puppet [09:47:48] joal: if you have time, batcave? [09:47:57] elukey: OMW ! [09:55:10] hellooooo team [09:55:35] Hi mforns [09:55:40] hey joal :] [09:55:44] I see hue is down [09:56:26] will read docs and see what I can do [09:57:33] mforns: this is elukey and me restar [09:57:47] oh! ok [09:57:48] mforns: swapping cluster manster-nodes [09:57:54] k k thanks [09:58:01] mforns: np :) [10:03:53] PROBLEM - Hadoop HDFS Zookeeper failover controller on an-master1001 is CRITICAL: NRPE: Command check_hadoop-hdfs-zkfc not defined [10:04:53] RECOVERY - Hadoop HDFS Zookeeper failover controller on an-master1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController [10:29:53] RECOVERY - Hue Server on analytics-tool1001 is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [10:39:23] PROBLEM - Age of most recent Analytics meta MySQL database backup files on an-master1002 is CRITICAL: CRITICAL: 0/1 -- /srv/backup/mysql/analytics-meta: No files [10:40:52] this is expected --^ [10:41:13] PROBLEM - Age of most recent Hadoop NameNode backup files on an-master1002 is CRITICAL: CRITICAL: 0/1 -- /srv/backup/hadoop/namenode: No files [10:51:09] (03CR) 10Mforns: Add CitationUsage and CitationUsagePageLoad to EL whitelist (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/462521 (https://phabricator.wikimedia.org/T205272) (owner: 10Bmansurov) [10:59:03] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [10:59:04] PROBLEM - Hadoop Namenode - Stand By on analytics1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [10:59:10] PROBLEM - At least one Hadoop HDFS NameNode is active on analytics1001 is CRITICAL: Hadoop Active NameNode CRITICAL: no namenodes are active [10:59:22] PROBLEM - Hadoop ResourceManager on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [10:59:24] PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [10:59:32] PROBLEM - Hive Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 [10:59:43] PROBLEM - Oozie Server on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap [10:59:52] PROBLEM - Hive Metastore on analytics1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [11:00:10] expired downtime [11:00:12] sorryyy [11:03:47] Yay ! [11:03:49] Back online [11:04:09] elukey: sorry for that interruptioln [11:04:13] RECOVERY - Hive Metastore on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore [11:04:15] elukey: How may I help? [11:04:50] joal: so I tested the mapreduce job listed in the etherpad, and now I have updated manually hdfs/yarn config on an1003 and I am starting hive [11:04:53] so we can test it [11:05:00] (without re-enabling crons) [11:05:03] RECOVERY - Hive Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 [11:05:45] ok [11:06:05] already up, you can test it [11:06:12] doi [11:07:10] spark2-shell seems working fine [11:07:36] hive job running, metastore ok, I think we're good [11:07:53] \o/ [11:08:43] man - testing stuff: hive --> 1min18sec, spark: 2secs !@! [11:09:12] lol [11:09:47] Hey all, I'm deploying https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/462042/ [11:10:16] this will increase the amount of analytics events for ReadingDepth schema, we're bumping from 0.001 to 0.1 [11:10:19] https://phabricator.wikimedia.org/T205176 [11:10:19] Thanks raynor for letting us know [11:11:32] elukey: oozie time? [11:12:08] sure! [11:12:49] started [11:13:02] RECOVERY - Oozie Server on analytics1003 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap [11:14:56] elukey: can you please restart hue? (warning in UI ASking for a restart) [11:17:20] I can sure, let's see if it fixes [11:18:56] so the first two should be ok (we have set the smtp to localhost some days ago) [11:18:59] last one is a bit weird [11:20:13] elukey: could related to HDFS not being available on an-tools? [11:20:55] joal, change is live [11:21:25] k raynor - shouldn't have any impact on our side (bigger data flowing in is all) [11:21:52] elukey: hive through Hue seems ok [11:21:55] checking oozie [11:22:38] yes, just a bigger data flow, we were afraid that this might overload our analytics cluster [11:22:41] so far so good elukey [11:22:45] ack [11:23:04] raynor: for mysql, it surely would have :) [11:23:11] I remember some time ago we started with 0.05 but because of load issues with MariaDB we dropped to 0.001 [11:23:46] raynor: with mariadb blacklist, I'm confident we should be ok [11:24:10] elukey: I think we're ok everything - let's reenable a cron? [11:24:20] or maybe all crons even elukey [11:24:45] dcausse: shall I restart the job we killed? [11:25:46] joal: any idea about https://phabricator.wikimedia.org/T193641#4614082 ? [11:25:47] :( [11:25:49] joal: sure, going to re-enable them [11:26:16] raynor: your deploy indeed makes some change: https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&var-schema=ReadingDepth&from=now-3h&to=now [11:26:19] looking addshore [11:26:24] thanks! [11:26:27] addshore: cache misc is gone, you'd need to use cache text now [11:26:32] (so webrequest text) [11:26:39] :( [11:28:09] addshore, elukey: unrelated to misc I think [11:28:49] oh? [11:29:03] elukey: this job is about editors, not visitors [11:29:15] sorry addshore I misread then [11:30:10] addshore: job seems successfully running - something else must be going on [11:32:51] elukey: camus and refine started on cluster [11:33:25] joal: there are a couple of webrequest jobs failed [11:33:58] Arf - Waiting time too long I guess [11:34:00] hm [11:34:24] it should be ok to just restart [11:34:52] elukey: Let's wait for camus to have caught up some delay, then we'll restart them [11:38:13] joal: thanks! [11:38:53] RECOVERY - Age of most recent Hadoop NameNode backup files on an-master1002 is OK: OK: 1/1 -- /srv/backup/hadoop/namenode: 0hrs [11:39:05] Thanks HDFS --^ [11:39:30] well it was me running /usr/bin/hdfs dfsadmin -fetchImage /srv/backup/hadoop/namenode [11:39:33] manually :P [11:39:45] Ah - Thanks elukey - HDFS, you cheater [11:43:22] addshore: just triple checked the query on Spark - Looks correct - The issue is somewhere else (maybe oozie config) [11:45:35] And by the way addshore, we could easily backfill the data (didn't think about it before) [11:45:43] yuppp :) [11:49:03] addshore: Nothing spotted in conf :( [11:49:20] addshore: will try to rerun the job manually now, and check [11:49:25] thanks! [11:50:59] addshore: I have an idea [11:51:12] addshore: Arf, not really [11:51:15] hm [11:53:32] RECOVERY - Age of most recent Analytics meta MySQL database backup files on an-master1002 is OK: OK: 1/1 -- /srv/backup/mysql/analytics-meta: 0hrs [11:53:39] joal: it'll restart next week on its own, no big deal. It's just to update pageviews no data will be lost [11:53:52] !log rerun as you prefer dcausse :) [11:53:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:53:55] ooops [11:54:01] yesssss (RECOVERY) [11:54:02] :) [11:54:03] You'll be in our logs dcausse :) [11:54:19] so proud :) [11:54:26] dcausse is always in our thoughts of course [11:55:03] ok joal, two stupid alarms remaining (nothing to worry about), but I'd say that we did it :) [11:55:06] \o/ [11:55:26] !log Rerun webrequest-load-wf-upload-2018-9-25-6 after failed SLA during hadoop master swap [11:55:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:55:29] \o/ !! [11:55:40] elukey: Thanks mate for handling the whole thing :) [11:56:07] thanks for reviewing and checking all the procedure! [12:01:18] (03CR) 10Fdans: "thank you for this @Joal <3" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/461666 (https://phabricator.wikimedia.org/T204707) (owner: 10Joal) [12:02:17] going to eat something! [12:10:29] addshore: I have no explanation, but rerunning the job makes the data appear [12:10:41] :D [12:11:12] addshore: Currently rerunning it for july, and data will be homogeneous [12:11:17] addshore: weird style [12:11:39] weird style ? [12:11:44] This job [12:15:32] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5037 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [12:16:36] hm [12:18:53] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5015 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [12:19:21] ah! [12:19:40] this should be related to raynor change? [12:19:46] elukey: I think it is [12:19:59] elukey: EL dashboard is happy, mysql consummer seems stable [12:20:07] however global rate is hig [12:21:12] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5050 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [12:21:21] joal: https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=5&fullscreen&orgId=1&from=1537872950397&to=1537877691855 [12:21:30] I am trying to understand if this is expected or not [12:21:46] hm - 2 spikes in 2 days - CitationUsage yesterday, ReadingDepth today [12:21:57] elukey: I suggest we add workers [12:22:36] joal: not that easy, we need to increase kafka partition size first [12:22:58] Really ? Wouldn't they share partitions? [12:23:03] yes, it can [12:23:55] I am talking about the eventlogging-client-mixed topic [12:23:58] some time ago events/s spiked to 1.3k (previously it was <10) [12:23:59] where all events end up [12:24:09] it is currently 12 partitions [12:24:25] so maximum of 12 workers [12:24:27] avg ~700events per second [12:24:49] raynor: and this is only 0.1%? [12:24:49] elukey: I thought we could have multiple workers per partition ... [12:25:02] yes, it's only 0.1% [12:25:07] woa [12:25:13] let me double check [12:25:14] are you guys planning to increase it? [12:25:36] (03CR) 10Bmansurov: Add CitationUsage and CitationUsagePageLoad to EL whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/462521 (https://phabricator.wikimedia.org/T205272) (owner: 10Bmansurov) [12:25:37] raynor: I'm assuming the data means mutiple events per pageview [12:26:09] (03PS1) 10Fdans: Add moving average as a dataset operation [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/462688 [12:26:11] it' [12:26:21] joal: sorry the topics is eventlogging-valid-mixed, but yes it holds only 12 partitions [12:26:27] right [12:26:30] it's reading depth - how far you read, I'm not sure if this is multiple events per page, I can check it, give me some time [12:27:07] elukey: I had that false idea that we could put more than one consumer per topic - Kafka messed up my mind :) [12:27:57] elukey: on eventlog1002, load raises, but memory and disk seem stable [12:28:01] joal: we can but the "extra" ones will stay in standby, taking over if one fails in the consumer group [12:28:08] yup [12:28:09] elukey - sorry, previously it was 0.1% [12:28:12] yeah all good, the main concern was the mysql db [12:28:13] now it's 10% [12:28:23] ah ok :) [12:28:33] please do not increase it more before contacting us [12:29:13] elukey, roger that [12:30:07] elukey: the spiky aspect of the curve is not nice for the last 1h or so :( [12:31:09] elukey: can it be due to backpressure from the all-handling-topic not ingesting fast enough? [12:33:04] yeah it is a bit weird [12:33:28] let's see if it stabilize, Andrew should be online in a few so we can discuss how to proceed [12:34:06] elukey, joal if you need I can revert the change - /cc HaeB [12:34:25] elukey: another not so nice thing - camus has not yet caught up on webrequest-text I hope it'll be done soon, but as of now, not yet [12:35:00] raynor, elukey: I also tink it;s better to wait for andrew and discuss when he'll be there [12:35:08] joal: well it may take a bit to recover all the data in 3/4 hours, let's re-check in a bit [12:35:16] ok [12:35:25] I'll be here, ping me when necessary [12:35:32] Thanks raynor [12:35:55] thanks! [12:37:08] !log Rerun webrequest-load-wf-text-2018-9-25-6 and webrequest-load-wf-text-2018-9-25-7 after SLA failure due to hadoop master swaps [12:37:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:37:16] elukey: data is present for those 2 [12:37:19] so rerun [12:37:25] ack! [12:37:29] Let's wait and see how it behaves on next hours [12:40:30] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5351 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [12:41:09] PROBLEM - cache_text: Varnishkafka Webrequest Delivery Errors per second on einsteinium is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:41:27] ouch [12:41:28] pff [12:42:02] elukey: you tell me how I can help? [12:42:56] it was a spike that recovered [12:43:05] but it may be Kafka under distress [12:43:38] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=36&fullscreen&orgId=1&var-instance=webrequest&var-host=All&from=now-3h&to=now [12:43:40] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5376 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [12:43:55] so ~12:07 UTC [12:43:57] elukey: That's what I was trying to check [12:44:16] let's see the deployment for el [12:44:19] RECOVERY - cache_text: Varnishkafka Webrequest Delivery Errors per second on einsteinium is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:45:01] hm elukey - might be related to camus reading a lot as well [12:45:04] yeah [12:45:12] when it'll have caught up, maybe better? [12:45:50] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5367 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [12:46:07] from the metrics it seems that the worst part is recovered [12:46:16] I think it was camus [12:46:25] elukey: very probabler - similar issue to the one we once had with Andrew when upgrading Kafka - When camus is late, the data it tries to read is not present in cached memory anymore, leading to a lot of pressure on disks [12:46:42] could be yes [12:46:43] not yet done [12:46:54] elukey: https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=30&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&from=now-6h&to=now [12:48:14] joal: I think it may also be [12:48:15] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=31&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&from=now-6h&to=now [12:48:35] elukey: not sure about what it means, but a big spike in nf_conns_track [12:48:54] those are network bytes out [12:49:08] elukey: for me reads-on-disks come from the ask to byte-out stuff that is not anymore in page-cache [12:49:11] the consumer that we run are somehow greedy in my opinion [12:49:21] elukey: :) [12:49:35] elukey: camus greedy, asking for oldish stuff = kafka under pressure [12:49:38] sure, but a consumer should in theory have a pace that doesn't brutalize kafka [12:50:13] well, webrequest topic has a large number of partitions with a not-so-large number of kafka machines [12:50:19] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5417 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [12:50:26] And We ask camus to have one consumer per partition [12:51:05] So it means ~12 camus consumers per kafka machine [12:51:23] (IIRC werbrequest_text topic has 72 partitions) [12:52:26] nono it has 24 [12:52:35] The shape of those curve is going on the right direction IMO - Hopefully it'll calm down [12:53:24] I still think that this shouldn't happen, even if page cache is not warmed up [12:53:36] camus is too greedy afaics [12:53:39] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5520 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [12:53:48] uff this alarm is noisy [12:53:54] downtiming [12:54:29] Ah - my bad on the partition side elukey - I however confirm camus has 72 workers (I knew this number wasd coming from somewhere [12:54:45] We should downside this to 48 (misc is gone) [12:56:21] 72??? [12:56:26] /o\ [12:56:30] OMG I finally have the power [12:56:45] * joal hides behind elukey [12:57:07] joal: it should use only 48 though [12:57:16] elukey: I think it does so [12:57:28] but 48 at full power maybe are a bit too much [12:57:29] elukey: or I should say: I hope it does so [12:58:00] I am 99% sure [12:58:03] elukey: On a normal base it works great - We should have procedures (maybe config change) for backfilling in a gentler way [12:58:27] or some sort of rate-limit [12:58:39] elukey: I think we should be able to configure that [12:59:14] that is also what's happening in eventlogging [12:59:23] same thing [13:00:01] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Create reports in wikistats UI for "most prolific editors" (a.k.a "top contributors") - https://phabricator.wikimedia.org/T189882 (10fdans) [13:00:03] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Create report for "articles with most contributors" in Wikistats2 - https://phabricator.wikimedia.org/T204965 (10fdans) 05duplicate>03Open [13:00:09] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5507 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:00:26] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Create report for "articles with most contributors" in Wikistats2 - https://phabricator.wikimedia.org/T204965 (10fdans) Apologies, merged this by mistake [13:00:35] ah right the alert is on einstenium [13:00:46] o/ [13:00:55] Heya [13:01:25] OH DID YOU GUYS ALREADY DOIT?!?! [13:02:02] yep! [13:02:13] OHHHH I THOUGHT IT WAS 10am MY TIME for some crazy reason [13:02:14] amazing! [13:02:17] how'd it go? [13:02:22] all good! [13:02:22] we haz new masterz [13:02:26] wow [13:02:27] nice job yall [13:02:51] allsmooth as butter? [13:02:57] jobs are gently catching up, but kafka is under pressure [13:03:18] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5535 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:03:20] yeah camus was a bit brutal as far as we saw [13:03:29] plus we have a new el event [13:03:39] oh hm [13:03:39] err not new but with increased sampling [13:03:59] so to summarize [13:04:35] 1) we re-enabled camus etc.. after ~3.30h of downtime (more or less) at around 12 and something UTC [13:05:04] 2) just a bit after raynor deployed a change to increase the readingdepth schema sampling [13:05:18] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5511 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:06:05] interesting [13:06:08] Varnishkafka for webrequest text complained a bit about dropped events (a single spike), seems related to camus catching up. [13:06:15] assuming things are just backed up tho? [13:06:18] oh vk dropped some [13:06:18] ? [13:06:20] yeah [13:06:22] kafka was that overloaded? [13:06:35] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1 [13:07:19] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5511 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:07:26] the network bytes out went up a lot, and as joseph was saying it might also have affected page cache [13:07:34] (not warmed up and hence hitting the disk) [13:07:54] Eventlogging seems fine for the moment, but the throughput graph is very spiky [13:09:35] elukey: CitationUsagePageLoad seems stable now - I feel better :) [13:09:36] that makes sense, but still strange that that would cause vk message loss, especially mostly from esams/ asia (what's the asia dc name?) [13:10:00] eqsin [13:10:03] spikey el i think makes sense ya? and is ok? maybe we should throttle camus [13:10:10] somehow [13:10:19] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5514 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:10:31] yep.. not sure how to do it [13:10:51] in the meantime, I'd try to silence this alert :D [13:11:43] still tho, i don't like the fact that camus (or any consumer) can cause this to happen... [13:12:57] same thing, it may have been something else, didn't check yet, but timing is really strange [13:13:05] thought of the moment on EL: looking at VirtualPageviews and RedingDepth event throughput makes me feel that those event are actually generated jointly at the functional level (meaning on the same user action) - Shouldn't we think about ways of not having them separatred? [13:13:15] ottomata: see https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=36&fullscreen&orgId=1&var-instance=webrequest&var-host=All&from=now-3h&to=now [13:13:26] I re-enabled camus more or less at 12:something UTC [13:14:05] joal: dunno, that might be true, but we will probably encourage people to make more individual events, not fewer [13:14:09] at least in the future [13:14:28] huh, only 2 brokers! [13:14:29] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5591 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:15:06] ottomata: While I understand, it might be wise to try to bundle functionally-collated events (the overhead price per event is not neglectable) [13:15:58] aye, dunno how long these events are, but we got that url char limit (for now) [13:16:09] but, even so, i want the system to scale way beyond what it is doing now [13:16:33] ottomata: I'm not afraid it wouldn't :) [13:16:38] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5642 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:17:25] ottomata: raising the --^ to 8000 ok? [13:17:55] +1 [13:17:59] elukey: raise to 10K [13:18:00] hmmm [13:18:02] no 8K is good [13:18:06] i guess we want to know [13:18:16] (03CR) 10Fdans: "@Nuria I can't replicate the table view issue you're describing. Otherwise I'll push now a link for the metric." (039 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/458784 (https://phabricator.wikimedia.org/T203180) (owner: 10Fdans) [13:18:34] too bad we don't have the anomaly based error anymore [13:18:39] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5649 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:20:02] we might! I haven't checked if Filippo made some progress on this side [13:20:20] ottomata: because it was in graphite and not in prometheus, right? [13:20:24] right [13:20:58] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5595 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:21:21] running puppet on einstenium --^ [13:22:59] PROBLEM - Throughput of EventLogging events on einsteinium is CRITICAL: 5608 ge 5000 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:23:00] just checking, everything ok? do you need a patch to decrease the sampling rate for ReadingDepth? [13:24:08] If you need I can just create a patch to decreate that to 1% (previously it was 0.1%, we increased that to 10%), ReadingDepth events should drop [13:24:49] raynor: it should be ok, the system is just catching up now, and the problem wasn't really caused by your increase (right elukey?) [13:25:11] seems so yes, but we wanted your opinion for eventlogging [13:25:14] seems that we are good now [13:27:20] elukey: looking at graphs in more details, and I really wonder if the spikyness we see in EL throughput is related to camus or not [13:27:50] elukey: camus read starts at 11:30 UTC (from both bytes out and disk-readIOPS) [13:28:08] should be later, ~12:something [13:29:30] raynor: sorry qq - do you have a timing of your deployment in UTC? [13:29:33] elukey: charts don't agree :) [13:29:42] what charts? [13:29:46] (03PS8) 10Fdans: Add pages to date metric [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/458784 (https://phabricator.wikimedia.org/T203180) [13:29:54] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&from=now-3h&to=now&panelId=30&fullscreen [13:30:10] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&from=now-3h&to=now&panelId=19&fullscreen [13:30:19] 10Analytics, 10Analytics-EventLogging: Increase number of partitions of eventlogging-client-side topic in Kafka jumbo-eqiad - https://phabricator.wikimedia.org/T205436 (10Ottomata) [13:30:21] just made ^ so I don't forget [13:30:46] joal: I then logged in the sal ~30 mins afterwrds [13:31:10] hm [13:31:24] elukey: "11:20 pmiazga@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: Increase sampling ratio for ReadingDepth (T205176) (duration: 00m 50s)" [13:31:25] T205176: Increase default sampling ratio of ReadingDepth - https://phabricator.wikimedia.org/T205176 [13:31:27] I can't think of something else having bumped kafka bytes out that way though [13:32:39] let's see camus log [13:34:17] 10Analytics, 10Analytics-EventLogging: Resurrect eventlogging_EventError logging to in logstash - https://phabricator.wikimedia.org/T205437 (10Ottomata) [13:34:57] joal: I have a log at 2018-09-25T06:23:14 and the next one at 18/09/25 11:30:07, so 11:30 indeed [13:35:02] my log was delayed [13:35:25] luckily the el deployment started 10m before that [13:35:37] but I am pretty sure it didn't matter that much [13:36:04] (or maybe a combination of the two things happening at once destabilized Kafka) [13:36:51] the interesting part from [13:36:52] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&from=now-3h&to=now&panelId=31&fullscreen [13:37:08] is that at around 12:something some brokers started to stream more data [13:37:12] it seems unlikely to me that an extra 800 events per second on kafka jumbo would cause this [13:37:18] how long was camus stopped? [13:37:21] 3 hours? [13:37:24] yeah a bit more [13:37:37] aye, that makes sense, that would cause a lot of disk io [13:37:44] on kafka brokers [13:38:27] but even that doesn't explain the whole thing, because if the consumer's pace was "gentle" enough we wouldn't have seen any big issues [13:38:50] probably 48 consumers all at once asking for data were a bit too much [13:38:52] camus is also limited to 9 minute run [13:39:00] it will consume for 9 minutes, then stop [13:39:05] in 9 minutes a lot of things can be streamed :D [13:39:06] and then cron(systemd timer??) will start it again [13:39:13] which might explain any spikeyness [13:39:47] (or maybe not :p) [13:40:58] * elukey sings "maybeeeee tomorrowwww I'll find my wayyyy" [13:41:44] well luca since things are looking mostly ok.... REVIEW MY PATCH CMMOOONNNN https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460417/ [13:41:45] :p [13:42:42] ottomata: can I do it after standup? Need to help Joe with one thing [13:42:51] I promise I'll do it [13:43:04] I meant to do it after hadoop but kafka interrupted me :P [13:43:25] yes yesyes :) [13:43:50] elukey: actually i might move on it before then, but if I do postreview i guess is fine. but maybe i won't get to it! [13:43:55] joal: VirtualPageviews and ReadingDepth don't share events generated at the same user actions (the former comes from page previews hovers, the latter from pageviews) [13:45:33] ottomata: ahahhaha [13:45:56] HaeB: I think they are somewhow related (whether readingdepth is also available on preview or something else), cause they follow very similar shapes in term of events throuput (more previews than RedingDepth, but same pattern) [13:48:17] yes, it makes sense that page previews are roughly proportional to pageviews over time, but the underlying events for the two schemas are still generated by very different user actions [13:50:32] ottomata, how are things going with exported resources in beta? [13:55:33] Krenair: not working yet, trying to figure out why [13:55:37] no errors [13:55:46] but no resources were found during the puppet run [13:56:09] where is the resource being exported and where is it supposed to be imported? [13:56:27] exported from e.g. deployment-kafka-jumbo-1 [13:56:34] imported on deployent-prometheus01 [13:56:41] and the manifests that do this? [13:59:36] jmx_exporter declared on kafka-jumbo-1 here: [13:59:37] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/kafka/broker/monitoring.pp#L33 [13:59:57] which itself declares jmx_exporter_instance [13:59:58] https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/prometheus/jmx_exporter.pp#L96 [14:00:01] then [14:00:31] role::prometheus::beta declares jmx_exporter_config: [14:00:32] https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/prometheus/beta.pp#L92 [14:00:53] which does query_resources: [14:00:54] https://github.com/wikimedia/puppet/blob/production/modules/prometheus/manifests/jmx_exporter_config.pp#L35 [14:01:08] for nodes that declared Prometheus::Jmx_exporter_instance [14:04:49] deployment-kafka-jumbo-1 has role::kafka::jumbo::broker [14:04:54] ya [14:05:13] which includes profile::kafka::broker [14:05:30] ya [14:05:56] if profile::kafka::broker::monitoring_enabled is set to true it'll include profile::kafka::broker::monitoring [14:06:07] ya [14:06:14] and it looks like profile::kafka::broker::monitoring_enabled: true [14:06:16] ok [14:06:34] ya, and the kafka process there is configured to use jmx exporter, so we know that it is applying there properly [14:06:41] and i can curl jmx metrics from it [14:06:53] so, now we just need prometheus configured to scrape it [14:07:14] $::site is still eqiad here, right? [14:07:23] that shouldn't matter [14:07:42] which i could run query_resources myself to test things [14:07:53] I'm not seeing any exporting here? [14:08:20] profile::prometheus::jmx_exporter includes prometheus::jmx_exporter_instance but doesn't export [14:08:31] prometheus::jmx_exporter_instance is an empty definition [14:08:50] yes [14:09:15] the jmx_exporter_instance is selected via query_resources function in jmx_exporter_config [14:09:22] in this case [14:09:29] class_name => 'profile::kafka::broker::monitoring', [14:09:29] instance_selector => 'kafka_broker_.*', [14:09:39] $resources = query_resources( [14:09:39] "Class[\"${class_name}\"]", [14:09:39] "Prometheus::Jmx_exporter_instance[~\"${instance_selector}\"]", [14:09:49] so it is looking for all nodes that have the class profile::kafka::broker::monitoring declared [14:10:24] is that something you can do? [14:10:27] and selecting from those all Jmx_exporter_instance named like 'kafka_broker_.*' [14:10:29] i guess so! [14:10:51] I've seen stuff getting exported like @@sshkey in ssh::server [14:10:54] https://github.com/dalen/puppet-puppetdbquery/blob/master/lib/puppet/parser/functions/query_resources.rb [14:10:55] yeah [14:11:08] which is then brought in with query_resources in modules/ssh/templates/known_hosts.erb [14:11:12] and that works in beta [14:11:31] interesting, yeah ok then. if query_resources works in beta for that, it should work for this [14:11:34] it's the @@ that makes it exported [14:11:41] have you tried this in prod? [14:11:44] yes [14:11:49] this is filippos stuff [14:11:53] and it's exporting everything even without @@ ? [14:11:53] (we could move to -operations) [14:12:10] ya [14:12:51] i *think* the @@ part just makes it availble to realize in puppet without the query_resources stuff [14:13:31] so you can collect it with the spaceship operator [14:13:47] right [14:19:40] joal: btw, what do you think of https://gerrit.wikimedia.org/r/#/c/mediawiki/event-schemas/+/439917/ ? [14:19:51] I am wondering about the instance selector [14:20:00] Prometheus::Jmx_exporter_instance[~\"${instance_selector}\"] [14:20:22] modules/profile/manifests/prometheus/jmx_exporter.pp calls the prometheus::jmx_exporter_instance $title [14:21:35] okay so modules/profile/manifests/kafka/broker/monitoring.pp will name it "kafka_broker_${::hostname}" [14:22:05] so if it's happy with matching Prometheus::Jmx_exporter_instance[~"kafka_broker_.*"] ... [14:22:17] right [14:22:22] and it's okay with that not being an exported resource... [14:24:50] joal,ottomata: how do you feel to nuke analytics100[1,2] ? [14:26:20] krenair@deployment-cumin:~$ sudo cumin "P{R:Class = profile::kafka::broker::monitoring}" [14:26:27] 4 hosts will be targeted: [14:26:27] deployment-kafka-jumbo-[1-2].deployment-prep.eqiad.wmflabs,deployment-kafka-main-[1-2].deployment-prep.eqiad.wmflabs [14:27:43] elukey: +1 [14:28:11] Krenair: cool that is corret [14:29:27] interestingly, nothing for profile::prometheus::jmx_exporter [14:30:16] oh but R:Profile::Prometheus::Jmx_exporter has stuff [14:31:15] krenair@deployment-cumin:~$ sudo cumin "P{R:Profile::Prometheus::Jmx_exporter ~ kafka_broker.*}" [14:31:18] 4 hosts will be targeted: [14:31:18] deployment-kafka-jumbo-[1-2].deployment-prep.eqiad.wmflabs,deployment-kafka-main-[1-2].deployment-prep.eqiad.wmflabs [14:31:45] er, same for kafka_broker_.* [14:32:57] krenair@deployment-cumin:~$ sudo cumin "P{R:Prometheus::Jmx_exporter_instance = kafka_broker_deployment-kafka-jumbo-1}" [14:33:01] 1 hosts will be targeted: [14:33:01] deployment-kafka-jumbo-1.deployment-prep.eqiad.wmflabs [14:33:46] same for R:Prometheus::Jmx_exporter_instance ~ kafka_broker_.* [14:34:34] 10Analytics: 'group' parameter in Reportupdater for automatic chgrp of generated reports - https://phabricator.wikimedia.org/T205441 (10mpopov) p:05Triage>03Normal [14:34:44] so far so good [14:34:46] :) [14:35:46] `R:Prometheus::Jmx_exporter_instance ~ kafka_broker_.*` gives you deployment-kafka-jumbo-[1-2].deployment-prep.eqiad.wmflabs,deployment-kafka-main-[1-2].deployment-prep.eqiad.wmflabs [14:36:07] `R:Class = profile::kafka::broker::monitoring` gives you deployment-kafka-jumbo-[1-2].deployment-prep.eqiad.wmflabs,deployment-kafka-main-[1-2].deployment-prep.eqiad.wmflabs [14:36:11] but [14:36:24] `(R:Class = profile::kafka::broker::monitoring) and (R:Prometheus::Jmx_exporter_instance ~ kafka_broker_.*)` gives you nothing [14:36:29] I wonder if my syntax is broken [14:42:20] hm [14:59:54] ottomata, elukey - sorry was gone for kids [15:00:15] ottomata: I know this schema is one of the ones we discussed - It's nonetheless not one I like :) [15:01:59] (03PS4) 10Fdans: Add most top editors and top edited pages metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/461635 (https://phabricator.wikimedia.org/T189882) [15:02:08] ping fdans [15:02:08] fdans: standuppp [15:08:18] sorriiiii [15:09:39] (03CR) 10jerkins-bot: [V: 04-1] Add most top editors and top edited pages metrics [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/461635 (https://phabricator.wikimedia.org/T189882) (owner: 10Fdans) [15:11:04] (03PS1) 10Bearloga: [WIP] Add support for group param for chgrp-ing generated reports [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/462732 (https://phabricator.wikimedia.org/T205441) [15:14:09] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add support for group param for chgrp-ing generated reports [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/462732 (https://phabricator.wikimedia.org/T205441) (owner: 10Bearloga) [15:15:46] 10Analytics, 10Analytics-Wikimetrics, 10Security-Reviews: security review of Wikimetrics {dove} - https://phabricator.wikimedia.org/T76782 (10sbassett) Got a response email from Dan/milimetric, tracking here: >> 1) The tool is still actively used (metrics.wmflabs.org) - seems to be, just wanted to confirm.... [15:19:56] 10Analytics, 10Discovery-Analysis, 10Product-Analytics, 10Wikidata, and 2 others: Query stats dashboard not updating - https://phabricator.wikimedia.org/T204415 (10mpopov) >>! In T204415#4612751, @Ottomata wrote: > Ok, I've added the analytics-search system user to the analytics-search-users group. You sho... [15:29:17] 10Analytics, 10Patch-For-Review: 'group' parameter in Reportupdater for automatic chgrp of generated reports - https://phabricator.wikimedia.org/T205441 (10mpopov) [15:31:51] mforns: hey, let me know if you wanna chat about T205441 (T204415#4615408 for context) and https://gerrit.wikimedia.org/r/#/c/analytics/reportupdater/+/462732/ [15:31:52] T204415: Query stats dashboard not updating - https://phabricator.wikimedia.org/T204415 [15:31:52] T205441: 'group' parameter in Reportupdater for automatic chgrp of generated reports - https://phabricator.wikimedia.org/T205441 [15:32:20] bearloga, yes, sure, I'm in a meeting, will ping you later on today, is that ok? [15:32:30] mforns: yup! [15:32:34] thanks! [15:32:41] ok :] [15:57:11] ottomata and team: groceryheist got analytics access yesterday (https://phabricator.wikimedia.org/T204790 ), but can't log into SWAP - perhaps his LDAP credentials haven't been created yet? [15:57:46] hi [15:59:38] o/ [15:59:48] groceryheist: can you try to log in now? So I can check logs [16:00:41] (also I need to know if you are on notebook100[3,4] [16:00:45] dpme [16:00:46] o [16:00:47] done [16:00:50] i'm on 1004 [16:01:29] it says invalid password for user etc.. [16:01:54] ok so my Username should be nathante because that is my shell name right [16:02:01] and my passwd should be my wikitech password [16:02:09] I just changed that passwd [16:02:15] ahhh [16:02:21] is that right? [16:02:47] sorry I didn't follow.. you changed your pass before trying to log in? [16:03:08] are you able to log in in wikitech? [16:03:38] i'm logged into wikitech [16:04:00] ahhh no no [16:04:02] I got it [16:04:08] ah [16:04:16] you are not in nda or wmf LDAP groups [16:04:20] this is why [16:05:07] * groceryheist nods [16:05:15] should that have been created as part of https://phabricator.wikimedia.org/T204790 ? [16:06:01] Hey HaeB [16:06:13] (if not, we may want to clarify https://wikitech.wikimedia.org/wiki/SWAP#Access ) [16:06:30] it depends, if the user belongs to a wmf employee usually is it added straight away [16:06:43] otherwise the user needs to request access to the 'nda' ldap group [16:07:05] i see [16:07:06] we can definitely clarify the page [16:07:17] 10Analytics, 10Analytics-Wikistats: Wikispecial wikis WikiStats Zeitgeist should include talk - https://phabricator.wikimedia.org/T37195 (10ezachte) @Aklapper I don't remember this bug at all.. As I'm wrapping up., this won't be for me. Will unhook. Thanks [16:07:36] usually it is a matter of opening a task to https://phabricator.wikimedia.org/project/view/1564/ [16:07:41] 10Analytics, 10Analytics-Wikistats: Wikispecial wikis WikiStats Zeitgeist should include talk - https://phabricator.wikimedia.org/T37195 (10ezachte) a:05ezachte>03None [16:07:45] requesting the access to the group [16:07:53] (NDA needs to be checked etc..) [16:09:49] elukey: should I make a new task or re-open this one: https://phabricator.wikimedia.org/T204790 [16:09:52] ? [16:10:04] 10Analytics, 10Analytics-Wikistats: WikiStats should recognize global bots - https://phabricator.wikimedia.org/T37196 (10ezachte) a:05ezachte>03None [16:11:55] 10Analytics, 10Analytics-Wikistats: Add a "number of articles with interwiki links" column in WikiStats tables - https://phabricator.wikimedia.org/T37197 (10ezachte) a:05ezachte>03None No longer assignee. [16:13:50] groceryheist: better to open a new one for the https://phabricator.wikimedia.org/project/view/1564/ project [16:19:41] elukey: ok https://phabricator.wikimedia.org/T205454 [16:24:11] elukey: like this? https://wikitech.wikimedia.org/w/index.php?title=SWAP&diff=1804118&oldid=1800640 [16:24:17] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [16:24:36] HaeB: +1 [16:24:39] groceryheist: ack! [16:33:08] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/462761/ [16:34:52] merged [16:36:42] 10Analytics: Why do dumps and pageview api have slightly different counts? - https://phabricator.wikimedia.org/T205457 (10Milimetric) [16:37:22] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Remove webrequest misc analytics related jobs and code after cache misc -> text merge is complete - https://phabricator.wikimedia.org/T200822 (10elukey) [16:43:10] analytics100[1,2] have role spare now [16:43:14] cc ottomata --^ [16:43:28] I also tried to remove all the hdfs-related crons from these [16:43:55] tomorrow I am going to open a task to decom them [16:44:55] * elukey off! [16:47:38] great! [16:47:43] 10Analytics, 10Analytics-Kanban: Remove sessionId, pageId pairs from whitelist - https://phabricator.wikimedia.org/T205458 (10Nuria) [16:48:01] resending, I think this didn't go through: anybody here who can talk about the Hive query failure I reported on the mailing list? nuria ? [16:48:30] NeilPatelQuinn[m: i can for a bit, sure [16:48:34] how are you running in notebook? [16:48:36] i want to repro [16:48:42] NeilPatelQuinn[m: on meeting but see my response, can you try to run query on stat1005, a copy of it is on /home/nuria/workplace/tmp [16:48:51] well, first I should say, I get the same thing on the command line [16:48:54] nuria: his problem is with hue/notebook, it should work there [16:48:55] OH [16:48:56] you do! [16:48:59] now that is interesting [16:49:00] ok trying [16:49:00] output is https://phabricator.wikimedia.org/P7590 [16:49:07] yeah, actually! [16:49:45] NeilPatelQuinn[m: did you tried hive instead of beeline? [16:49:55] NeilPatelQuinn[m: like hive -f query> out.txt [16:50:08] No, I thought beeline is the recommended client? [16:50:09] Will try [16:50:43] NeilPatelQuinn[m: beeline is newer and should be better...but sometimes it isn't :/ [16:50:48] yeah it works for me with hive [16:50:50] strange that beeline doesn't work [16:50:56] NeilPatelQuinn[m: beeline has a harder time reporting pertinent errors [16:50:58] but, i think beeline works more like the other clients do [16:53:14] mforns: the "- Fix hover box" bug on wikistats.. do you have a ticket number? [16:55:54] NeilPatelQuinn[m: i don't know why but i think https://stackoverflow.com/questions/46439306/failed-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-mr-ma/50578689 [16:55:56] works [16:56:00] SET hive.auto.convert.join=false; [16:57:27] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-PriorSupportforMAPJOIN [16:58:38] cc milimetric: the "fix hover box" bug on wikistats.. do you have a ticket number? we have it on our small list for goals and maybe fdans can work on it [16:58:54] looking [17:00:02] Hi! Can I ask a question about reportupdater? For one of our report (https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/modules/metrics/search/invoke_source_counts.sql), there is this error message in the log: ```pymysql can not execute query ((2006, "MySQL server has gone away (error(32, 'Broken pipe'))")).``` [17:00:36] I didn't see this problem in other reports... Has anyone seen this before? [17:01:21] 10Analytics: Make hover info-box on bar charts consistent with line charts - https://phabricator.wikimedia.org/T205461 (10Milimetric) [17:01:49] 10Analytics: Make hover info-box on bar charts consistent with line charts - https://phabricator.wikimedia.org/T205461 (10Milimetric) p:05Triage>03High [17:02:08] nuria / fdans: I couldn't find one so I made it ^ https://phabricator.wikimedia.org/T205461 [17:04:41] chelsyx: that happens from time to time, should be ok because RU will just rerun the query until it gets an answer [17:04:58] it looks like the data for that metric is up to date as of Sept. 24th, should be ok [17:05:13] chelsyx: basically mysql can reboot or connections can time out, etc. [17:05:31] chelsyx: if it happens consistently, and you see data stagnate for a few days, let us know [17:07:52] milimetric: Ah yes, it's back now. Thanks! [17:12:34] milimetric: thank youuuu [17:15:34] if i want to install some emacs packages on a stat machine do i need to go through puppet? [17:24:22] 10Quarry: Allow quarry queries to be executed by someone else without the need to fork - https://phabricator.wikimedia.org/T203791 (10zhuyifei1999) [17:30:10] fdans: given that we agreed per goals to just make sure we do not install any python2 software this quarter let's move to https://phabricator.wikimedia.org/T205461 and put the python3 work on back burner [17:32:43] nuria: cooooool, working on that now, is there a task for druid indexation? [17:36:17] !log stopping refine jobs and deploying refinery source 0.0.75 - T203804 [17:36:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:36:24] T203804: Refactor Refine job scalaopt to use property files and CLI overrides - https://phabricator.wikimedia.org/T203804 [17:44:56] (03PS1) 10Joal: Update top-editors endpoint not to show IPs [analytics/aqs] - 10https://gerrit.wikimedia.org/r/462778 (https://phabricator.wikimedia.org/T204707) [17:45:53] ottomata, NeilPatelQuinn[m : The error from the join was coming from a map-join error? [17:46:18] Also NeilPatelQuinn[m - Have you tried to use the user_history table? I could help with that if you want [17:46:53] nuria, milimetric sorry missed ping [17:53:13] joal: afaict, yes [17:53:17] turning off map join fixes it [17:53:27] hm - Super bizarre [17:53:42] Problem also occurs in beeline from what I read [17:54:42] but, it seems to affect clients that use jdbc rather than hive-server? or hive metastore? [17:54:44] not sure which [17:54:52] yes, beeline, hue and notebook [17:54:59] which i guess all use the same interface (jdbc?) [17:55:28] ottomata: not jdbc, thrift-server [17:55:42] aye, right [17:55:54] jdbc is in the connection url, got that from there [18:15:06] shoot hm, problem. i didn't test running confighelper in yarn cluster [18:15:10] properites files not present...fixing... [18:15:11] someow... [18:20:48] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Niedzielski) @Ottomata, @fgiunchedi hello! We'... [18:26:36] fdans: yes, there are several, but i think mforns was working on the issue that prevented ingesting this schema: https://phabricator.wikimedia.org/T202751 [18:26:59] mforns: can you add the task that we have about flattening arrays to the ticket? [18:27:08] nuria, sure [18:29:22] 10Analytics: [EventLoggingToDruid] Allow ingestion of simple-type arrays by converting them to strings - https://phabricator.wikimedia.org/T201873 (10mforns) [18:29:24] 10Analytics, 10Page-Issue-Warnings, 10Product-Analytics, 10Reading-analysis, 10Readers-Web-Backlog (Tracking): Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10mforns) [18:29:41] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Ottomata) They should be. Something is not wor... [18:34:17] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Event Schema Registry - https://phabricator.wikimedia.org/T201063 (10Tbayer) >>! In T201063#4593550, @Ottomata wrote: >> As a data analyst or product manager, I want a canonical place wher... [18:38:53] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Ottomata) BTW, I updated https://wikitech.wikim... [18:39:17] 10Analytics, 10Analytics-Kanban: [EventLoggingToDruid] Allow ingestion of simple-type arrays by converting them to strings - https://phabricator.wikimedia.org/T201873 (10Nuria) [18:47:13] joal: what was it about this way of doing the schema that you didn't like? [18:47:16] the huge merged structs? [18:47:50] the revision-score schema? [18:50:22] 10Analytics, 10Pageviews-API, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Large increase on 404s from the Wikipedia IOS app - https://phabricator.wikimedia.org/T203688 (10JMinor) p:05Triage>03Normal [18:51:09] hey ottomata [18:52:06] My dislike is for the huge-merg, that I would have named "explicit-union-type" [19:00:19] 10Analytics-Kanban, 10Patch-For-Review: Refactor Refine job scalaopt to use property files and CLI overrides - https://phabricator.wikimedia.org/T203804 (10Ottomata) Woohoo, did it! Re-running refine jobs is now WAY easier. I updated the docs at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine#Re... [19:00:33] <3 ottomata --^ :) [19:00:47] :) [19:00:50] joal: don't understand [19:00:54] 'explicit-union-type'? [19:01:47] like a union type in avro for instance, but at the structure level [19:01:59] ohoh with the types? [19:02:02] like [19:02:18] "prediction: { [19:02:18] [19:02:18] } [19:02:20] oops [19:02:31] "prediction": { [19:02:31] "boolean": true [19:02:31] } [19:02:31] ? [19:02:47] "prediction": { [19:02:47]   "string": "cool__category" [19:02:47] } [19:02:48] ? [19:06:42] also the per-model main property (articlequality, damaging ...) [19:07:08] All having almost the same base schema :( [19:09:30] yeah, we might be able to solve that eventually if we use json $refs [19:09:32] but we will see [19:13:41] joal: but i became convinced that we had to have the schema as much as possible [19:13:47] in order to use things like kafka connect [19:13:57] which will create the hive schema (or whatever) from the jsonschema [19:14:13] inference isn't really possible from the data one event at a time [19:14:46] ottomata: I understand the reason for which we go for that, I'd have prefered the more reduced schema [19:16:19] ottomata: I'm sorry to be blunt, but you asked me my opinion :) [19:24:20] hey ywah gimme! [19:24:28] the reduced schema would be which one? [19:24:41] without all the probability names explicit? [19:24:52] yes [19:25:04] plus, without the explicit model-object [19:30:07] something around this (wrong format and all, but you'll have the idea): https://gist.github.com/jobar/0178d2af63998b4bbb026f94352df422 [19:31:51] (ee sorry meeting now) [19:31:57] milimetric: ping meeting [19:32:11] finishing up another meeting ottomata be there soon [19:32:14] k [19:42:51] Is there a way to check Hive error logs ? [19:43:04] e.g. /var/log/hive/hive-server2.log [20:04:33] groceryheist: logs are distributed in several nodes, an application has an id whole log lives a short time. Addedd some info here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#FAQ [20:05:11] groceryheist: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Search_through_logs [20:12:01] ah thanks nuria [20:21:10] !log Webrequest warning for upload-2018-09-25-13 were all false positives [20:21:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:22:48] (03CR) 10Nuria: Update top-editors endpoint not to show IPs (032 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/462778 (https://phabricator.wikimedia.org/T204707) (owner: 10Joal) [20:24:48] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Krenair) (for context: `modules/prometheus/mani... [20:25:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Wikistats: add functions you apply to dimensional data such as "accumulate" - https://phabricator.wikimedia.org/T203180 (10Nuria) @fdans I think table view is not a bug, rather we are displaying on table data points for which there was no value and, as you... [20:27:01] joal: did we deploy the patch to aqs? https://gerrit.wikimedia.org/r/#/c/analytics/aqs/+/462778/ [20:27:33] nuria: you just commented on it, so nope, not deployed :) [20:28:06] nuria: There is a bunch to deploy (project-families for additive metrics, top-reformat, and top-no-IPs when corrected) [20:28:23] nuria: I'll also need a patch for restbase endpoints to match our backend ones [20:28:26] 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959 (10Nuria) I see, yes, all our oozie jobs use the check but - at this time- if you have access to druid you can run indexations on private cluster directly. Underst... [20:29:21] nuria: alos, thanks for comments - Way better indeed to add an IP parsing lib ... My bad [20:29:47] joal: argh my mistake i meant to ask about the one restricting date ranges for top endpoints , sorry. [20:29:58] nuria: np - not deployed [20:30:00] joal: teh so called top-reformat [20:30:15] joal: ok, makes sense that we are grouping all of them to deploy [20:30:23] nuria: trying to find short-but-expresive names is ottomata's pseciality ;) [20:30:30] joal: hahahah [20:30:33] joal: ayayay [20:31:47] nuria, milimetric - Would https://www.npmjs.com/package/ip be good for us? [20:32:10] I'd use only a small bit of it, but it's maintained and has quite a bit of usage [20:32:18] joal: I think we just need this one: https://github.com/sindresorhus/ip-regex#readme [20:32:36] nuria: works for me [20:32:37] joal: i believe is the one the other uses to identify regex right? [20:32:43] joal: let me check [20:34:12] joal: ah no, this is yet adifferent one: https://github.com/indutny/node-ip/blob/master/lib/ip.js#L87 [20:34:26] nuria: I've seen that yes [20:34:43] nuria: ip-regex is really neat and clean - I'm happy to go with that one [20:35:39] joal: either one works i think, sounds good [20:47:07] (03PS2) 10Joal: Update top-editors endpoint not to show IPs [analytics/aqs] - 10https://gerrit.wikimedia.org/r/462778 (https://phabricator.wikimedia.org/T204707) [20:47:25] (03CR) 10Joal: "Changes done, with my apologizes for the wrong addition of local files." (032 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/462778 (https://phabricator.wikimedia.org/T204707) (owner: 10Joal) [20:47:58] Ok team - Gone for tonight - see you tomorrow evening (kids day) [20:48:18] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Krenair) With these puppet changes: ```diff --g... [20:57:51] (03CR) 10Nuria: [V: 032 C: 032] Update top-editors endpoint not to show IPs [analytics/aqs] - 10https://gerrit.wikimedia.org/r/462778 (https://phabricator.wikimedia.org/T204707) (owner: 10Joal) [21:04:08] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Ottomata) @fgiunchedi Alex's change ^ should do... [21:14:16] 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Operations, 10Patch-For-Review, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Krenair) [21:52:45] (03PS1) 10Ottomata: Remove potentially dangerous Refine Config defaults [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/462820 (https://phabricator.wikimedia.org/T203804) [21:53:27] (03PS2) 10Ottomata: Remove potentially dangerous Refine Config defaults [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/462820 (https://phabricator.wikimedia.org/T203804) [21:57:13] (03CR) 10jerkins-bot: [V: 04-1] Remove potentially dangerous Refine Config defaults [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/462820 (https://phabricator.wikimedia.org/T203804) (owner: 10Ottomata) [22:45:22] Anyone here use hive variables? I'm having a bit of trouble making them work. [22:46:40] groceryheist: hive variables? [22:47:01] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution [22:47:36] groceryheist: you need to pass those as part of the config [22:48:01] ah. and if I want them just locally ? [22:49:18] groceryheist: maybe if you can explain what you are trying to do i can help you better? We use variables in hql in scripts but maybe you aretrying to do something else ? [22:50:05] I'm writing an hql script to make histograms. I want to make the bin width a parameter to an hdl script [22:50:19] but it would be convenient if I can test it using a local variabl [22:50:50] groceryheist: you can pass variables in this fashion: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics_hourly.hql [22:51:19] groceryheist: hive will do teh substitution from your command line arguments [22:51:29] that seems good [22:52:00] can I emulate this if i'm working in beekeeper? [22:52:02] milimetric: do you have any suggestions as to naming here: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventLogging/+/458864/? [22:52:25] groceryheist: i recomend hive if you want to get more pointed error ,messages [22:53:05] groceryheist: i think you mean "beeline" right? it is less friendly when it comes to returning errors [22:53:28] ah yeah i mean beeline [22:57:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update UA parser - https://phabricator.wikimedia.org/T189230 (10Nuria) [22:57:42] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Reading-analysis: Assess impact of ua-parser update on core metrics - https://phabricator.wikimedia.org/T193578 (10Nuria) 05Open>03Resolved [22:58:34] thanks nuria [23:03:36] 10Analytics-Kanban: Private geo wiki data in new analytics stack - https://phabricator.wikimedia.org/T176996 (10Nuria) [23:03:40] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Turn off old geowiki jobs - https://phabricator.wikimedia.org/T190059 (10Nuria) 05Open>03Resolved [23:03:55] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965 (10Nuria) [23:03:58] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Add data-quality check on mediawiki-history-reduced before druid indexation - https://phabricator.wikimedia.org/T192483 (10Nuria) 05Open>03Resolved [23:04:11] 10Analytics, 10Analytics-Kanban: Upgrade Analytics infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T192642 (10Nuria) [23:04:15] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage thorium to Debian Stretch - https://phabricator.wikimedia.org/T192641 (10Nuria) 05Open>03Resolved [23:04:26] 10Analytics-Kanban, 10Patch-For-Review: Correct user registration date in mediawiki-history - https://phabricator.wikimedia.org/T202269 (10Nuria) 05Open>03Resolved [23:04:41] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Create reports in wikistats UI for "most prolific editors" (a.k.a "top contributors") - https://phabricator.wikimedia.org/T189882 (10Nuria) [23:04:43] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Add AQS endpoint providing top editors ("most prolific contributors") and top pages (by number of edits, by net-bytes-diff and abs-bytes diff) - https://phabricator.wikimedia.org/T201617 (10Nuria) 05Open>03Resolved [23:04:56] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats 2 Backend: Resiliency, Rollback and Deployment of Data - https://phabricator.wikimedia.org/T177965 (10Nuria) [23:04:58] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Add Mediawiki-History data-quality check stage in oozie using statistics - https://phabricator.wikimedia.org/T192481 (10Nuria) 05Open>03Resolved [23:05:12] 10Analytics, 10Analytics-Kanban: Upgrade Analytics infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T192642 (10Nuria) [23:05:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Archiva (meitnerium) to Debian Stretch - https://phabricator.wikimedia.org/T192639 (10Nuria) 05Open>03Resolved [23:12:54] Is beeline -f my_query.hql > file_name.tsv still the best way to get a table out of hdfs? [23:13:08] i'm still waiting on SWAP [23:29:33] groceryheist: i would use hive [23:30:07] groceryheist: hive -f quqery.hql > out.txt but you can create a database in hive under your own name and dump data there [23:30:24] groceryheist: that might be the easiest [23:30:44] groceryheist: event with swap access commands to hive will not change much [23:32:52] yes I have made a database under my name, is there a way to move data from there to a stat machine? [23:35:10] groceryheist: you can query hive -f select.hql > out.txt that will dump data out, please read data access guidelines as data shoudl not leave stats machines [23:35:22] groceryheist: hive -f select.hql > out.txt [23:36:27] * groceryheist nods, not going to move data out of stats machines [23:36:36] hive -f works [23:39:50] looks like hdfs dfs -text /wmf/data/archive/browser/general/desktop_and_mobile_web-2015-9-27/* should also work [23:44:26] another question: where can I make plots?