[01:19:32] 10Quarry, 10Data-Services: Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host - https://phabricator.wikimedia.org/T176694#3633999 (10bd808) [01:19:49] 10Quarry, 10Data-Services: Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host - https://phabricator.wikimedia.org/T176694#3634013 (10bd808) [03:44:33] 10Quarry, 10Data-Services: Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host - https://phabricator.wikimedia.org/T176694#3633999 (10zhuyifei1999) I donno is the config.yaml is puppetized somewhere (probably not), but this change should do the switch: ``` $ sed 's/enwiki.labsdb/enwi... [03:45:38] (03CR) 10Nuria: [V: 031 C: 031] Add draft creation config and query (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/379441 (https://phabricator.wikimedia.org/T176375) (owner: 10Nettrom) [03:46:00] (03CR) 10Nuria: [V: 032 C: 032] Add draft creation config and query [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/379441 (https://phabricator.wikimedia.org/T176375) (owner: 10Nettrom) [03:51:08] (03CR) 10Nuria: Add oozie jobs loading druid monthly uniques (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/348052 (https://phabricator.wikimedia.org/T159471) (owner: 10Joal) [04:10:46] 10Quarry, 10Data-Services: Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host - https://phabricator.wikimedia.org/T176694#3634206 (10zhuyifei1999) 05Open>03Resolved a:03zhuyifei1999 LGTM [08:48:24] Hi a-team [08:51:29] 10Analytics-Kanban: Replace references to dbstore1002 by db1047 in reportupdater jobs - https://phabricator.wikimedia.org/T176639#3634730 (10elukey) >>! In T176639#3632819, @mpopov wrote: >>>! In T176639#3632795, @elukey wrote: >> Is there a valid reason why `analytics-slave` shouldn't be used? Are we talking ab... [08:58:35] joal: o/ [09:02:28] elukey: Do you mind me self-merging some patches and deploying? [09:03:00] (03CR) 10Joal: "Comments inline." (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/348052 (https://phabricator.wikimedia.org/T159471) (owner: 10Joal) [09:05:13] The two patches I'd like to merge are: https://gerrit.wikimedia.org/r/#/c/379836/ and https://gerrit.wikimedia.org/r/#/c/379835/ [09:07:24] nope [09:07:35] (nope == green light from me) [09:07:43] Okey [09:09:01] elukey: here is my plan: self-merging & deploying refinery-source, update refinery for jar versions and then deploy. Will ask for quick reviews on changelogs etc if you don't mind [09:09:20] +1 [09:09:32] joal: what is the fix for the dt:null issue? [09:09:41] (asking because I didn't follow the resolution) [09:09:54] elukey: https://gerrit.wikimedia.org/r/#/c/380533/ [09:10:17] ah nice [09:12:38] (03CR) 10Joal: [C: 032] "Self merging for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/379835 (owner: 10Joal) [09:17:34] (03Merged) 10jenkins-bot: Correct field names in mediawiki-history spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/379835 (owner: 10Joal) [09:18:19] git up [09:18:22] oops [09:21:19] (03PS1) 10Joal: Update changelog.md for v0.0.53 deployment [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/380709 [09:21:42] elukey: --^ when you have a minute [09:23:51] (03CR) 10Elukey: [C: 031] Update changelog.md for v0.0.53 deployment [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/380709 (owner: 10Joal) [09:24:22] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/380709 (owner: 10Joal) [09:33:34] !log Releasing refinery-source v0.0.53 with Jenskins [09:33:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:10:20] (03PS1) 10Addshore: instanceof.php improved extra logging on sparql result [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/380715 (https://phabricator.wikimedia.org/T176577) [10:10:35] (03PS1) 10Addshore: instanceof.php improved extra logging on sparql result [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/380716 (https://phabricator.wikimedia.org/T176577) [10:10:44] (03CR) 10Addshore: [C: 032] instanceof.php improved extra logging on sparql result [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/380716 (https://phabricator.wikimedia.org/T176577) (owner: 10Addshore) [10:10:47] (03CR) 10Addshore: [C: 032] instanceof.php improved extra logging on sparql result [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/380715 (https://phabricator.wikimedia.org/T176577) (owner: 10Addshore) [10:10:50] (03Merged) 10jenkins-bot: instanceof.php improved extra logging on sparql result [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/380716 (https://phabricator.wikimedia.org/T176577) (owner: 10Addshore) [10:10:53] (03Merged) 10jenkins-bot: instanceof.php improved extra logging on sparql result [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/380715 (https://phabricator.wikimedia.org/T176577) (owner: 10Addshore) [10:13:00] (03PS2) 10Joal: Correct field names in mediawiki-history jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/379836 [10:15:44] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/379836 (owner: 10Joal) [10:18:51] (03PS1) 10Joal: Bump jar version to 0.0.53 in oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/380718 (https://phabricator.wikimedia.org/T175707) [10:19:20] elukey: --^ if you have aminute [10:23:35] (03CR) 10Elukey: [C: 031] Bump jar version to 0.0.53 in oozie jobs [analytics/refinery] - 10https://gerrit.wikimedia.org/r/380718 (https://phabricator.wikimedia.org/T175707) (owner: 10Joal) [10:23:59] elukey: Thanks :) [10:24:22] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/380718 (https://phabricator.wikimedia.org/T175707) (owner: 10Joal) [10:25:04] !log Deploy Refinery with scap [10:25:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:29:07] email sent joal :) [10:29:18] Great elukey, thanks a lot ! [10:29:30] elukey: I hope we'll have a go, and we'll be able o move fast :) [10:32:25] y [10:35:00] !log Deploying refinery onto HDFS [10:35:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:53:52] (03CR) 10GoranSMilovanovic: [C: 032] WDCM Usage Dashboard - Crosstabs [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380652 (owner: 10GoranSMilovanovic) [10:53:57] (03CR) 10GoranSMilovanovic: [V: 032 C: 032] WDCM Usage Dashboard - Crosstabs [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380652 (owner: 10GoranSMilovanovic) [10:53:59] (03Merged) 10jenkins-bot: WDCM Usage Dashboard - Crosstabs [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380652 (owner: 10GoranSMilovanovic) [10:56:54] * elukey lunch! [10:57:09] err joal, forgot to ask if everything is ok with the deployment [10:57:30] Everything good elukey - Currently restarting jobs [10:57:42] all right, be back in 30 mins :) [10:58:28] !log Restart webrequest load job after deploy [10:58:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:01:36] !log Restart mediawiki-history-denormalize and mediawiki-history-druid jobs after deploy [11:01:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:36:23] Ok a-team, taking a break, everything look good afer deploy (still a job to be restarted, but waiting for current to finish [12:19:27] (03Abandoned) 10Addshore: DNM JENKINS JOB VALIDATION [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380306 (owner: 10Addshore) [12:28:46] (03CR) 10Addshore: "check experimental" [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380306 (owner: 10Addshore) [12:29:11] (03CR) 10jenkins-bot: DNM JENKINS JOB VALIDATION [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380306 (owner: 10Addshore) [12:29:41] (03CR) 10Addshore: "check experimental" [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380652 (owner: 10GoranSMilovanovic) [12:32:24] (03CR) 10jenkins-bot: WDCM Usage Dashboard - Crosstabs [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380652 (owner: 10GoranSMilovanovic) [13:16:15] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Use Prometheus for Kafka JMX metrics instead of jmxtrans - https://phabricator.wikimedia.org/T175922#3635207 (10fgiunchedi) >>! In T175922#3630917, @fgiunchedi wrote: > Yeah the idea is to have dedicated Prometheus instances roughl... [13:36:43] elukey: HIIiiiI! [13:44:25] o/ [13:54:17] elukey: heyyyy [13:54:22] got a min to talk about druid profile stuf? [13:54:44] hmm, wanna do ops sycn today instead of tomorrow? [13:54:49] tomorrow i'll be at strata [13:56:18] sure! [14:00:27] bc? [14:00:46] oh bc is taken elukey [14:00:55] going to bc 2 [14:00:55] https://hangouts.google.com/hangouts/_/wikimedia.org/a-batcave-2 [14:02:18] ottomata: grabbing the headphones [14:16:18] 10Analytics-Kanban: Replace references to dbstore1002 by db1047 in reportupdater jobs - https://phabricator.wikimedia.org/T176639#3635520 (10mforns) @mpopov And regarding an alternative that we considered during the sinc-up meeting: Using `db1047.eqiad.wmnet` instead of `analytics-slave.eqiad.wmnet` would work,... [14:17:56] Hi, did anyone notice any problems in connecting from Labs to tools.labsdb recently? One of my Shiny Dashboards cannot connect there since 10 minutes ago or so. Thanks. [14:20:46] (03PS1) 10Mforns: Replace references to dbstore1002 by db1047 [analytics/limn-flow-data] - 10https://gerrit.wikimedia.org/r/380751 (https://phabricator.wikimedia.org/T176639) [14:22:01] Just checked w. #wikimedia-cloud there is definitely a problem with tools.labsdb and they're working to resolve it. [14:22:08] GoranSM: i dont' know why it wouldn't work, but there is also now https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/ [14:22:11] if you haven't yet seen [14:22:13] cool [14:22:13] :) [14:22:30] (03PS1) 10Mforns: Replace references to dbstore1002 by db1047 [analytics/limn-edit-data] - 10https://gerrit.wikimedia.org/r/380755 (https://phabricator.wikimedia.org/T176639) [14:23:19] ottomata: Thanks, I've seen the announcement but didn't check out whether do I need to change the server that I connect to from Labs or not. Will check it out. [14:24:16] (03PS1) 10Mforns: Replace references to dbstore1002 by db1047 [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/380756 (https://phabricator.wikimedia.org/T176639) [14:24:20] ottomata: tools.labsdb) will continue to support user created databases and tables. (from the announcement). Anyways, I'm sure they will take care of everything. [14:24:27] ya [14:26:18] ottomata: done! [14:26:25] merged and cool? [14:26:40] yep! [14:26:43] gr8 [14:29:59] (03PS1) 10Mforns: Replace references to dbstore1002 by db1047 [analytics/limn-ee-data] - 10https://gerrit.wikimedia.org/r/380757 (https://phabricator.wikimedia.org/T176639) [14:31:36] (03PS1) 10Mforns: Replace references to dbstore1002 by db1047 [analytics/limn-multimedia-data] - 10https://gerrit.wikimedia.org/r/380758 (https://phabricator.wikimedia.org/T176639) [14:32:09] elukey, ottomata: sorry for having missed ops standup [14:32:23] anything I can help with to speed up druid process? [14:32:35] hmm, joal don't think so [14:32:41] ok [14:33:22] ottomata: is cluster: jumbo_kafka final? [14:33:38] (Filippo is asking it to me since it will change how metrics are stored) [14:34:31] jumbo_kafka? [14:34:43] i still don't fully understand what puppet 'cluster' is [14:34:55] the official kafka cluster name is [14:34:57] jumbo-eqiad [14:36:18] I think it is a logical grouping of hosts in puppet, that might reflect on other things.. for example, IIUC ganglia clusters == puppet clusters [14:37:38] ottomata: we also have profile::kafka::broker::kafka_cluster_name: jumbo [14:37:45] not jumbo-eqiad [14:37:55] is it ok/ [14:37:55] ? [14:39:04] yes, because that is passed to the kafka_cluster_name function [14:39:11] which auto suffixes the site [14:39:17] that way you can use the same role in multiple sites [14:39:59] super [14:40:06] so for analytics we currently have cluster: analytics_kafka [14:40:11] yes, but that is a historical problem [14:40:20] and i'm not really sure where that came from (did I make that? i don't know) [14:40:25] what about main? [14:40:56] I didn't find any [14:41:03] no? [14:41:13] I mean cluster: naming [14:41:13] also, we do'nt even includ eganglia i think [14:41:18] on these nodes [14:41:30] if !defined(Class['::standard']) { [14:41:30] class { '::standard': [14:41:31] has_ganglia => false [14:41:31] } [14:41:31] } [14:41:35] (03PS1) 10Mforns: Replace references to dbstore1002 by db1047 [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/380759 (https://phabricator.wikimedia.org/T176639) [14:41:39] would like to know what this cluster_name is used for [14:41:44] sure sure ganglia was an example [14:42:05] other clusters are like "logstash, cache_text, etc.." [14:43:24] yes but why? what is using cluster name? [14:44:23] elukey: if we had to pick one, maybe I would do [14:44:27] kafka_jumbo-eqiad [14:44:38] or kafka-jumbo-eqiad (not sure which is better there) [14:45:14] not sure if cluster_name wants the DC suffix or not [14:46:08] ahhaha kafka_jumbo-eqiad made my eyes bleeding [14:48:32] haha [14:48:35] but, it makes sense! [14:48:38] they are just separator [14:48:40] ss [14:48:46] 'jumbo-eqiad' is the cluster name [14:48:48] kafka_ is a prefix [14:48:58] but yeah, i knew it would make your eyes bleed (mine too a bid) [14:49:02] bit* [14:49:11] !log restart [14:49:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:49:18] !log restart mobile_apps session_metrics bundle [14:49:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:55:18] maybe we can just set kafka-jumbo [14:55:39] I am trying to get if this will be our cluster name in the prometheus dashboard [14:55:51] in that case kafka_jumbo_eqiad or kafka-jumbo-eqiad [14:56:52] 10Analytics-Kanban, 10Patch-For-Review: Replace references to dbstore1002 by db1047 in reportupdater jobs - https://phabricator.wikimedia.org/T176639#3635646 (10mforns) @elukey @mpopov In my opinion, the technical nomenclature master/slave is a really unhappy choice (from decades ago). But nowadays it's used... [14:58:46] 10Analytics-Kanban, 10Patch-For-Review: Replace references to dbstore1002 by db1047 in reportupdater jobs - https://phabricator.wikimedia.org/T176639#3632150 (10Milimetric) primary/replica, primary/secondary, master/doctor (HA!!!) are all fine alternatives. As others have pointed out, we might also have to ch... [14:58:58] (03PS1) 10Mforns: Replace references to dbstore1002 by db1047 [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/380762 (https://phabricator.wikimedia.org/T176639) [15:01:27] elukey: if you have to, do kafka-jumbo-eqiad, as that will be easier to use with the kafka_cluster_name in puppet [15:01:32] since it uses hyphen jumbo-eqiad [15:10:54] maybe kafka_jumbo is fine? it seems more consistent to other cluster naming across puppet, plus IIUC we'll need to choose the prometheus source as variable in our dashboards (like https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats -> datasource) [15:13:14] ahhh the cluster is also a cumin target: https://puppet-compiler.wmflabs.org/compiler02/8039/kafka-jumbo1001.eqiad.wmnet/ [15:20:32] mforns: batcave-2 for talk about mw-reduced? [15:20:40] joal, ok! [15:32:30] a-team gonna do talk if you want [15:34:17] joal: you wanna? [15:34:21] I do [15:34:44] joining ottomata [15:39:04] (03PS1) 10GoranSMilovanovic: Sep 26 2017 Add logo + Crosstabs [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380770 [15:40:28] (03CR) 10GoranSMilovanovic: [V: 032 C: 032] Sep 26 2017 Add logo + Crosstabs [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380770 (owner: 10GoranSMilovanovic) [15:40:34] (03Merged) 10jenkins-bot: Sep 26 2017 Add logo + Crosstabs [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380770 (owner: 10GoranSMilovanovic) [15:57:28] (03CR) 10Bearloga: [C: 031] "Just a heads up that this repo is inactive and that if I had admin rights on it, I would have marked it as read-only when we deprecated it" [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/380762 (https://phabricator.wikimedia.org/T176639) (owner: 10Mforns) [16:02:34] mforns, milimetric: Time for a talk on metrics definition (again)?? [16:02:39] yeeeo [16:02:43] yep [16:02:51] joal: I'm in the manager's thing [16:02:54] oh [16:03:08] you guys got it, I trust you! [16:03:09] milimetric: ok no prob, we'll wait for you to finish [16:03:23] milimetric: That one is bit more tricky, we'll wait [16:03:32] wikimedia/mediawiki-extensions-EventLogging#698 (wmf/1.31.0-wmf.1 - 902a144 : libraryupgrader): The build has errored. [16:03:36] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/compare/wmf/1.31.0-wmf.1 [16:03:36] Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/280031142 [16:05:34] mforns elukey: how long is dbstore1002 aka analytics-store.eqiad.wmnet going to keep running? [16:06:35] like, is there an estimated date at which it will be shut down and that CNAME removed? [16:07:04] hi bearloga, I'm not sure... [16:07:46] bearloga, I'd say within next quarter [16:07:59] elukey? [16:10:55] so ideally in our plans the log database will stop being replicated on dbstore1002 when the new db1047 replacement will be ready [16:11:06] so dbstore1002 will only replicate wikis [16:11:21] we might keep the analytics-store cname [16:12:58] eventually, hopefully after few of the next quarters, we'll be able to migrate mysql users of the log database to HDFS [16:13:39] only then we'll start talking if it is possible to stop maintaining mysql for the log database (so cleaning up analytics-slave) [16:14:20] (03CR) 10Mforns: "Oh, OK!" [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/380762 (https://phabricator.wikimedia.org/T176639) (owner: 10Mforns) [16:20:07] (03CR) 10Mforns: "@Bearloga here's the puppet patch, added you as a reviewer:" [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/380762 (https://phabricator.wikimedia.org/T176639) (owner: 10Mforns) [16:21:12] (03Abandoned) 10Mforns: Replace references to dbstore1002 by db1047 [analytics/discovery-stats] - 10https://gerrit.wikimedia.org/r/380762 (https://phabricator.wikimedia.org/T176639) (owner: 10Mforns) [16:21:34] Hi! What's the status of https://wikitech.wikimedia.org/wiki/SWAP? Can/should I use it instead of Ipython for some quick CentralNotice-y checks I need to do? Thanks!!! [16:21:54] madhuvishy: ^ [16:22:34] (The doc talks about "plans"...) [16:22:48] Mmm gonna try it NEway :) [16:23:04] AndyRussG: yeah if its to talk to analytics stores, use it [16:23:28] madhuvishy: K fantastic :) thx!! Yep that's what it's for [16:23:29] AndyRussG: https://gist.github.com/madhuvishy/d349c472de1279568534e4fb2b5bf505 [16:24:16] ok, joal, done [16:24:31] I'll go to the cave and maybe chop some veggies for lunch at the same time [16:24:36] give me 4 minutes [16:24:43] madhuvishy: neat! [16:24:45] Nettrom: did you ever get your ldap access squared away? [16:25:29] madhuvishy: yes, I got that the day after I requested it, and since then I’ve been able to access the notebook, thanks! [16:25:42] Nettrom: oh perfect :) yw [16:29:11] mforns elukey: got it, okay, thanks [16:29:49] milimetric, mforns: batcve ? [16:30:18] joal, omw [16:41:29] quick question: is there a way to figure out how many requests hit api.php a year ago? [16:42:36] doing this for the last three months is fairly easy via the webrequest table & hive, but there is no historical data beyond that [16:48:25] gwicke: you said it - no historical data beyond that [16:49:48] okay, just double checking that there is no other place to look for this kind of thing [16:49:52] thanks! [16:50:32] np gwicke [16:50:53] sorry for not having answered positively gwicke ) [17:09:07] * elukey off! [17:43:29] joal: hi! how's it going? :) Can I query Druid directly? (I could get the same data from Hive, but if I can get it instantaneously and using less resources from Druid, it seems better) [17:43:43] I recall seeing some doc somewhere, haven't found it again yet [17:43:45] thx in advance! [17:45:46] AndyRussG: yes you can, but it is a json query language [17:46:07] http://druid.io/docs/latest/querying/querying.html [17:46:17] we are in the process of setting up an lvs balanced service, but you can hit any druid broker node [17:46:19] so try [17:46:22] druid1001.eqiad.wmnet [17:46:22] as the hostname [17:46:30] oh and 8082 as the port [17:54:04] 10Analytics, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice: Make banner impression counts available somewhere public - https://phabricator.wikimedia.org/T115042#1713344 (10ksmith) @Nuria : Have you read the [[ https://docs.google.com/document/d/1R3G04PCe3xZAR2azWPzdWVO4vQQKraZUrlOMQidfh0w/edit... [18:01:00] ottomata: cool fantastic! aaaaaand... Caaaan I do it from SWAP/Jupyter? /me writes letter to Santa, folds it into a paper airplane, and flies to the North Pole [18:01:44] or, has anyone tried to do it from SWAP/Jupyter yet? [18:02:53] AndyRussG: i think you shoudl be able to [18:03:06] but i don't know of anyone who has [18:03:08] if you do, i would LOVE to see it [18:03:12] that would be super cool [18:06:11] ottomata: cool, will do! :) [18:39:19] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL: 62.71% of data above the critical threshold [3891.2] [18:43:06] hello an1001 [18:43:15] Arf :( [18:43:37] ottomata: looks like I can't connect from notebook1001 to druid1001 on port 8082 / expected? [18:43:45] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&from=now-24h&to=now [18:43:59] something seems to have changed in metrics from ~17:30 UTC onwards [18:44:10] nothing super horrible but enough to trip alarms [18:44:35] elukey: Huge job running - Might be that? [18:46:17] could be yeah [18:46:26] hm [18:46:47] gwicke: by the way, not sure if you're aware - the hive queries you run are huge [18:46:48] joal: not expected [18:46:51] i can make that connection [18:47:24] joal ya it works fine for me from CLI [18:47:33] ottomata: Will try agian [18:47:45] joal: yeah, unfortunately we don't have live metrics for the action API [18:48:21] or filtered logs for the api only [18:48:58] gwicke: sure, but not a good reason to make huge queries without letting us know first :) [18:49:05] I'm only looking at the september numbers for action api vs. rest api [18:49:32] gwicke: other way to think of it is doing daily queries instead of requesting september data multiple times over the month [18:50:00] yeah, we have that for the rest api, but not the action api [18:50:30] other question for you gwicke: /api/rest_v1/% is used in both text and upload ? [18:51:16] I think it's pretty much all text [18:51:21] err, sorry [18:51:27] rest api is text only [18:51:45] not 100% sure about api.php [18:51:58] gwicke: the query you run now is "uri_path like /api/rest_v1/%" [18:52:14] yep [18:52:35] is there a way to limit queries to the text cluster? [18:52:38] asking that gwicke because we have a partition by webrequest_source (text, upload, misc) - half of it is text, the other one upload (misc is almost nothin) [18:53:10] If you can add a filtering clause "AND webrequest = 'text'" you'll drop the query size by 2 [18:53:15] gwicke: --^ [18:53:15] so basically just add something like webrequest_source = "text" ? [18:53:21] indeed [18:53:34] hat would be a first and easy improvement :) [18:53:55] I'd also reduce the scope of the query and not use the whole month [18:53:59] happy to add that, but the current query is at 78% [18:54:02] worth canceling? [18:54:07] gwicke: nope [18:54:21] okay [18:54:36] gwicke: however I second elukey : run queries every week on one week of data, it'll be faset on your side, and easier for he cluster :) [18:54:48] not using the whole month means that we'd need to set up some kind of cron tool [18:55:01] and then aggregate the daily counts [18:55:10] (or weekly) [18:55:14] gwicke: We use oozie for that, and other teams also use reportupdater [18:55:23] gwicke: I think it's easy [18:55:43] gwicke: lastly, I hope you won't run a full month query when september ends for real :) [18:55:54] are there docs on how to set that up? [18:56:22] gwicke: oozie is not super easy, I'd go for reportupdater [18:56:23] the current queries are ad-hoc, for some slides Victoria is preparing [18:56:35] ideally, we'd have action api metrics every couple minutes [18:57:05] that would avoid the need for the manual querying, and would also give us historical data beyond three months, similar to the rest api [18:58:18] gwicke: what would be the issue of aggregate numbers in daily/weekly counts? [18:58:24] gwicke: I thimnk what you're after is exactly what you have in graphite (from the job I see running every hour on he cluster) [18:58:29] (I am trying to understand the issue) [18:58:51] joal: yeah, but that's only for the rest api [18:59:00] we'd need the same for the action api [18:59:02] so that we can compare [18:59:07] gwicke: Isn't that wha you are currently querying ? [18:59:13] Ah [18:59:20] I'm querying both rest & action, sequentially [18:59:37] to make sure that the methodology is comparable [19:00:15] gwicke: Given the size of he query (you read ~35Tb of data), getting rest results out of graphite is certainly less easy but also less costly in term of computation [19:00:20] hm [19:00:32] I don't fully trust the rough "multiply avg request rate from graphite" thing, especially when we do that only for one of the numbers [19:00:37] makes sense, but really 2 quesries over 1 month of webrequest is not nice for the cluster [19:01:10] (joal coudl we add tags for those two things? if we don't already?) [19:01:20] ottomata: Yes, we will [19:01:21] * elukey off again [19:01:37] ottomata: aggrgation will stillbe needed post-tagging, but yes, we will [19:01:43] joal: yeah, I'll definitely add the webrequest (or webrequest_source?) clause [19:02:09] also, is there a way to count both in a single pass? [19:02:13] gwicke: thank you [19:02:18] gwicke: TRhere is [19:02:55] also please query less data in one go, it is important [19:03:06] we are currently getting alarms firing [19:03:26] hmm, from a single sequential query? [19:04:06] gwicke: webrequest text only, for both action and rtest? [19:04:21] is there something you can tune to avoid that setting off alarms? [19:04:43] gwicke: We have alarms, but we're not sure it was triggered by your job [19:04:50] joal: yes [19:05:19] There currently are multiple big jobs running - it might be the all-of-them together triggering [19:05:37] gwicke: last time you ran those type of queries it didn't fire [19:05:40] gwicke: sure there is, people collaborating and not making huge queries. I mean, what is really the problem of not querying an entire month of webrequest data? [19:05:55] so I'm assuming you're not the culprit by yourself :) [19:06:18] joal: this time there are multiple big jobs, but gwicke's query is really big and by far the most resource consuming [19:06:49] again, we support all the users needs, we just ask to avoid querying too much data [19:06:53] in one go [19:06:59] elukey: as far as I understand it, that would only really help if data was stored & incrementally rolled up [19:07:37] just doing a loop from 1 to 31 in a shell would still touch the same data [19:08:18] gwicke: doing it sequentially would prevent burst somewhat [19:08:30] gwicke: but yes, we'd rather do it incrementally [19:08:42] gwicke: it wouldn't use all the containers that are currently allocated in once [19:08:48] but in small batches [19:09:07] so if the cluster is already busy, it will not get into alarms warning zone [19:09:21] and it wouldn't change much your results [19:09:45] alternatively, is it possible to tell hadoop to be smarter about parallelism? [19:09:51] .... [19:10:49] need to go now sorry! Joseph let me know later on if the issue is still ongoing, I'll check after dinner (need to go out with some friends now) [19:10:58] in the meantime, my query is all done [19:12:20] good gwicke [19:12:22] hm, the alert was for JVM namenode heap size, which is interesting, and does not usually happen just for large queries [19:12:29] What's the pattern for rest_action ? [19:12:33] the cluster is smart about parallelism and assigning workers [19:12:40] you could try using the 'nice' queue [19:12:45] but i'm not totally sure if htat would help or not [19:12:46] ottomata: look at all the other metrics, they are all spiking from 17:30 UTC [19:13:07] anyhow, will check tomorrow :) [19:13:08] byyyeee [19:13:08] joal: /api/rest_v1/% [19:13:28] gwicke: this one is rest, action? [19:13:47] rest_v1 ;) [19:14:04] meeeeeh gwicke :) [19:14:27] * joal no understanding [19:14:56] rest [19:15:06] action is /w/api.php% [19:15:23] ok gwicke [19:17:13] gwicke: https://gist.github.com/jobar/f371ef9179699c41bf86fdbe1c493b9e [19:17:22] gwicke: this gives you both results in one pass [19:18:01] nice, thank you! [19:18:13] gwicke: we plan to have smaller datasets to query for what you wish soon (we'll tag rows in webrequest and split the data) [19:18:47] gwicke: the last part will then be to add final aggregations in order to be able to keep data long term [19:19:17] updated https://wikitech.wikimedia.org/wiki/RESTBase#Analytics_and_metrics [19:20:00] nod, that would be nice [19:20:53] would it be hard to add a graphite metric for /w/api.php% requests, possibly even in the same job as the rest api one? [19:20:57] gwicke: do you have results for rest_v1 for august? [19:21:04] gwicke: supper easy [19:21:20] that one runs every 30 minutes already [19:21:34] gwicke: we already have a job doing so for rest_v1, adding action is easy [19:21:53] gwicke: would you have number of request done to rest_v1 in augus? [19:22:09] and powers the varnish part of https://grafana-admin.wikimedia.org/dashboard/db/api-summary?panelId=1&fullscreen&orgId=1 [19:22:22] approximately, from graphite [19:22:27] via avg [19:22:39] I think we could have exact numbers [19:23:11] if we had live metrics for both action & rest api, I think that would be good enough for our purposes [19:23:26] any small errors like packet loss would affect both metrics [19:25:13] gwicke: I have monthly numbers from graphite for rest_v1 using summarize [19:26:54] so far, I have 14.3 bn from hive, ~14.9bn via graphite avg [19:27:38] gwicke: for september 01 to 26? [19:27:47] yeah [19:27:55] to now [19:27:57] in both cases [19:29:07] 10Analytics: Add action api counts to graphite-restbase job - https://phabricator.wikimedia.org/T176785#3636597 (10JAllemandou) [19:29:13] gwicke: --^ [19:29:29] RECOVERY - HDFS active Namenode JVM Heap usage on analytics1001 is OK: OK: Less than 60.00% above the threshold [3686.4] [19:29:40] gwicke: --^ as well [19:30:12] joal: cool, thank you for your help! [19:30:47] gwicke: np, sorry for bothering, it was really big ! [19:30:54] my graphite number gwicke : 14293820969 [19:31:06] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jul-Sep 2017): Fix duplicated enrollments in database - https://phabricator.wikimedia.org/T176786#3636616 (10Aklapper) [19:31:19] joal: that's a lot closer [19:31:35] just summarize(metricname, "1month")? [19:31:46] gwicke: correct :) [19:31:50] gwicke: single stat mode [19:31:54] to get a single number [19:32:11] And, from=true [19:32:27] actually gwicke :summarize(metricname, "1month", sum, true)? [19:34:18] ahh, yes [19:34:50] in a dashboard it shows up as "avg" [19:34:53] ;) [19:35:09] Didn't know gwicke :) [19:37:10] ottomata: I get a 503 querying druid from paws [19:37:25] ottomata: I think it's because it runs inside a docker [19:39:41] that is probably it [19:39:46] isolated network stuff [19:39:51] dunno how madhu makes that magic work [19:39:52] yup [19:40:02] she gets it to connect to hive though... [19:40:04] ottomata: I have no idea [19:40:09] ottomata: very true ! [19:41:08] hmmmm joal can you try from notebook1002 [19:41:13] ? [19:41:15] ottomata: mayve there is some special connection [19:41:18] thye have slightly different configs [19:41:22] and it looks like only 1002 has a hadoop client [19:42:47] ottomata: I can access hive from notebook1001 [19:43:07] ottomata: issue is connecting to http://druid1001.eqiad.wmnet:8082 [19:48:40] huh werid [19:48:58] it must be configured in the paws internal wheels deploy? [19:49:01] i don't see anything in puppet for it [19:51:47] weirdoh [19:55:47] (03PS7) 10Joal: Update mediawiki-history-reduced oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/379000 (https://phabricator.wikimedia.org/T174174) [19:55:56] mforns, milimetric --^ if you have a minute [19:56:17] I updated the docs and example queries, and also added a paragraph for us to remember having to add deletion drift [19:56:22] cd ../../aqs [19:56:24] oops [19:56:36] Directory not found [19:57:21] LOL milimetric [20:00:54] and now the aqs one as expected :) [20:01:00] (03PS10) 10Joal: Add mediawiki-history-metrics endpoints [analytics/aqs] - 10https://gerrit.wikimedia.org/r/379227 (https://phabricator.wikimedia.org/T175805) [21:19:34] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jul-Sep 2017): Fix duplicated enrollments in database - https://phabricator.wikimedia.org/T176786#3637294 (10Aklapper) In theory done: ``` To https://github.com/Bitergia/mediawiki-identities.git f44cdb9..50e1ccc master -> master ``` In practice let... [22:12:19] PROBLEM - HDFS active Namenode JVM Heap usage on analytics1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [3891.2] [22:21:29] RECOVERY - HDFS active Namenode JVM Heap usage on analytics1001 is OK: OK: Less than 60.00% above the threshold [3686.4] [22:49:53] (03PS1) 10GoranSMilovanovic: Fix Tabs/Crosstabs Init [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380892 [22:50:11] (03CR) 10GoranSMilovanovic: [V: 032 C: 032] Fix Tabs/Crosstabs Init [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380892 (owner: 10GoranSMilovanovic) [22:50:17] (03Merged) 10jenkins-bot: Fix Tabs/Crosstabs Init [analytics/wmde/WDCM] - 10https://gerrit.wikimedia.org/r/380892 (owner: 10GoranSMilovanovic)