[00:50:43] PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:15:25] (03CR) 10Joal: [C: 04-1] Move pageview filters to PageviewDefinition; add Webrequest.isWMFHostname (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646808 (https://phabricator.wikimedia.org/T256674) (owner: 10Ottomata) [06:26:25] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10elukey) Yep! Looks good, my only suggestion is/was to avoid having multiple instances of superset using the db at the same time to avoid "multi-writes" scenarios. One question for... [06:26:38] bonjour [06:51:13] RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:57:11] !log restart presto on an-presto1003 since all the memory on the host was occupied, and puppet failed to run [06:57:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:00:51] I am reducing the heap size of the workers to 100g [07:04:53] !log roll restart presto cluster to pick up new jvm xmx settings [07:04:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:10:24] Good morning [08:13:21] o/ [08:14:34] question regarding event gate, when adding a new stream how/when do the tables in hive get created? [08:16:25] Hi dcausse - Normally tables are automagically created in hive uppon data hitting the cluster [08:16:31] 10Analytics-Kanban: Test the Bigtop 1.5 RC release on the Hadoop test cluster - https://phabricator.wikimedia.org/T269919 (10elukey) p:05Triage→03High [08:17:08] joal: ok nice, thanks! [08:21:43] joal: yesterday I had a very interesting chat about a new cassandra feature in 3.11 with Eric [08:21:49] AH? [08:22:30] with the new version, to better support JBOD, they added dedicated compactor to each of the data dir created (a sort of consistent hashing for sstables) [08:23:00] I understand all the words, but not yet the implications :) [08:23:18] so say simply creating two data dirs instead of one (for the same instance) forces cassandra to split sstables, and hence everything around them (compaction, lookup, etc..) [08:23:55] so levelled compaction for example should be more efficient with two data dir [08:24:31] Ah! I think I get it now - With folder-dedicated compactors, you can be performant over 2 folders mounted on different disks [08:24:43] exactly yes [08:24:49] Nice!P [08:25:05] the main issue is that system tables do not work with jbod, there was a bug, that should be fixed with 4.x [08:25:16] so we cannot go from raid to jbod [08:25:25] Arf - ok [08:25:31] buuut we can add say two data dirs and see how levelled compaction behaves [08:26:07] elukey: wihout disks <-> folders link, I'm not sure if we'd get perf enhancement [08:26:24] IIRC compaction already pushes the load of the hardware quite high [08:27:19] sure but we could try, splitting one big compaction into two/three/etc.. parallel ones might be less aggressive overall [08:27:28] them (compaction, lookup, etc..) [08:27:28] 09:23:55 <@elukey> so levelled compaction for example should be more efficient with two data dir [08:27:32] woops [08:27:39] sorry :) [08:28:08] we'll have a new cluster so it is a good time to experiment :) [08:28:24] possible elukey - From the host metrics I have the feeling that compaction is already parallelized, but maybe forcing it to happen to separate folders can help [08:28:28] sure [08:30:05] also I didn't get this new feature completely before (I think Eric mentioned it), we could have bought a different set of nodes [08:30:28] like more close to what we have for hadoop (flexbay for root and separate disks for jbod) [08:31:39] elukey: And with that, would we have been able to ask cassandra to use root for system tables? [08:34:30] joal: nono but we could have thought about cassandra 4.x, Eric said that it is not that far from 3.11 feature-wise, so we could have thought about jumping to the new one (that correctly handles system tables with jbod) [08:34:46] ack! [08:34:48] I get it [08:34:59] or fallback to a raid with all the disks [08:35:02] like we do now [08:36:04] ahh but the next week's on-call person is Joseph! [08:36:19] * elukey brace yourself, oozie restarts are coming :D [08:38:24] You couldn't be more right elukey :D [09:17:56] (03PS1) 10Itamar Givon: Sanitize and keep WikibasePingback events [analytics/refinery] - 10https://gerrit.wikimedia.org/r/648139 (https://phabricator.wikimedia.org/T269918) [10:33:10] 10Analytics: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10elukey) [10:34:10] (03PS2) 10Lucas Werkmeister (WMDE): Add script to collect lexicographical data statistics [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647284 [10:34:54] 10Analytics: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10elukey) [10:36:28] 10Analytics: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10elukey) [11:21:02] 10Analytics, 10Patch-For-Review: Update Spicerack cookbooks to follow the new class API conventions - https://phabricator.wikimedia.org/T269925 (10elukey) [11:35:30] * elukey lunch! [13:22:20] Hey folks! Question about Turnilo cubes, specifically pageviews_daily; I'm looking for a quick snapshot of pageviews from Nov. 30 - current but it seems like December data hasn't been populated yet. When does this get updated? [13:23:15] Hi eyener - pageview_daily gets updated monthly so you won't get it with Dec. pageviews before beginning of january [13:23:32] eyener: for recent updates you can use pageview_hourly [13:24:01] eyener: data is hourly instead of daily (as the name states), and it contains 3 month only [13:24:48] Super, joal, thank you! Looking now [13:30:19] Hello hello, me again with more greenhorn questions :) I'm trying to log in to hue, and I keep getting this error `java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient` Is there anythin I can/need to do about it? [13:30:39] **anything [13:31:03] itamarWMDE: This thing rings a bell :S [13:31:07] I can't recall [13:31:41] Oh - I think it could be related to cookies - Could you clear your cookies for hue.wikimedia.org and retry please itamarWMDE ? [13:31:59] sure [13:34:01] (03PS2) 10Andrew-WMDE: [WIP] Process EventLogging events for VisualEditor [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/647742 (https://phabricator.wikimedia.org/T262209) [13:35:26] that did the ticket in terms of loading the dbs, but now I get `Error while compiling statement: FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient` when trying to run my query :/ [13:35:41] MEH [13:35:41] (so the same error, but wrapped in another one [13:38:08] Wait, I might've been a bit of a dunce, when I run older queries from history, they seem to work [13:38:18] Ah? [13:38:31] :O [13:38:37] I added a semicolon.... [13:38:54] so the syntax is indeed incorrect [13:40:14] itamarWMDE: hue queries seem to wrok for me : [13:41:09] Something odd is happening on my side though...I'll try to figure it out, thanks for your help joal :) [13:41:27] np itamarWMDE - Your problem rings a bell but I can't really recall :S [13:49:36] This worked, had to switch to Hue 3 (couldn't find the session option in Hue 4): [Hive query errors with Kerberos](https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hue) [13:50:10] That was it itamarWMDE! Thanks for helping my poor memory :) [13:51:51] it is a hue bug :( [13:52:05] that reminds me about hue-next, it is really a mess [14:29:10] 10Analytics, 10Analytics-Kanban: AQS should be more resilient to druid nodes not available - https://phabricator.wikimedia.org/T268811 (10Pchelolo) In the end requests passed to hyperswitch are passed on to https://github.com/wikimedia/preq library, which has `timeout` and `retries` options: https://github.com... [14:30:18] Pchelolo: thanks a lot! [14:58:27] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data: Schema repository structure, naming - https://phabricator.wikimedia.org/T269936 (10jlinehan) [15:03:49] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data: Schema repository structure, naming - https://phabricator.wikimedia.org/T269936 (10jlinehan) [15:09:09] (03CR) 10Ladsgroup: [C: 03+2] Update MediaWiki CodeSniffer to version 34.0.0 [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647282 (owner: 10Lucas Werkmeister (WMDE)) [15:09:25] (03PS1) 10Ladsgroup: Update MediaWiki CodeSniffer to version 34.0.0 [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/648266 [15:09:39] (03CR) 10Ladsgroup: [C: 03+2] Update MediaWiki CodeSniffer to version 34.0.0 [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/648266 (owner: 10Ladsgroup) [15:10:43] (03Merged) 10jenkins-bot: Update MediaWiki CodeSniffer to version 34.0.0 [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/647282 (owner: 10Lucas Werkmeister (WMDE)) [15:11:03] Amir1: no puppet changes but analytics-related stuff?? :( [15:11:18] (03Merged) 10jenkins-bot: Update MediaWiki CodeSniffer to version 34.0.0 [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/648266 (owner: 10Ladsgroup) [15:11:28] haha, I will make one, this week I had some distractions :( [15:11:30] sorry [15:11:40] ahahahah please don't say that, I am kidding [15:13:27] don't worry, I will happily ruin your weekend [15:31:01] joal: on hadoop test there is 2.10 :) [15:31:18] it is not finalized, I want to rollback and deploy again [15:31:31] but it looks good, I don't see anything horribly on fire [15:32:19] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data: Schema repository structure, naming - https://phabricator.wikimedia.org/T269936 (10Ottomata) Haha "should this be capitalize"...no! :) I'd like to make a suggestion. For app specific event schemas, prefix with the app name: - analytics/medi... [15:35:04] elukey: sorry about hue-next, I know I dropped that on the floor and ran away. Do we schedule some time to pair next week? [15:37:10] milimetric: sure! But it is not your fault, the new version is really.. disappointing [15:37:22] I forced myself to use it during ops week and it is a nightmare [15:37:34] but I don't have any alternative in mind [15:37:47] yeah, that was my code smell right away. I wasn't exaggerating eh? :/ [15:38:02] Switch from oozie sooner? [15:38:21] with the effort we wasted here we might as well cut our losses [15:39:24] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data: Schema repository structure, naming - https://phabricator.wikimedia.org/T269936 (10jlinehan) >>! In T269936#6684966, @Ottomata wrote: > For app specific event schemas, prefix with the app name: > - analytics/mediawiki/mediasearch_interaction >... [15:48:06] milimetric: yes that would be idea.. [15:48:08] *ideal [15:53:34] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data: Schema repository structure, naming - https://phabricator.wikimedia.org/T269936 (10Ottomata) [15:55:28] heya ottomata, let's finish the migration later today? [15:58:43] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data: Schema repository structure, naming - https://phabricator.wikimedia.org/T269936 (10Mholloway) So they're linked here, here are the recommendations from Product Analytics that I think this task is at least partly in response to: https://www.medi... [16:11:04] mforns: o/ [16:11:12] holooloo [16:11:48] the other day there was a question about https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater#How_to_test, should we update it with somet instructions for say stat100x hosts? Like how to test etc.. or maybe add a note that an-launcher is only for analytics-team members [16:11:53] what do you think? [16:13:22] elukey: yes, I think it's besto to use a stats machine to test no? [16:13:36] at least say so in the docs [16:13:52] there's nothing that prevents us to use them to test RU right? [16:14:17] mforns: so on those we don't have the analytics user keytab, that is the main problem (if we as team want to test jobs) [16:15:26] elukey: but RU doesn't usually write to Hive [16:15:38] theoretically it shoudln't [16:15:54] the output reports are written locally on the users folder [16:17:29] ah yes yes hive is not needed then stat100x are super fine [16:18:40] elukey: would RU be able to access the wiki database replicas? do they have /etc/mysql/conf.d/stats-research-client.cnf ? [16:18:48] I mean from stat boxes [16:18:50] checking [16:19:08] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10razzi) @elukey: I confirmed that memcached was working based on the presence of superset_result keys in memcached. Currently there'presto dashboards are not loading on staging due... [16:20:43] hm elukey the stat boxes do not have the creds file [16:21:15] mforns: they should have the analytics-privatedata ones for sure [16:21:56] elukey: but then, one must execute RU as analytics or there will be a permission error when passing the creds file? [16:27:26] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10elukey) >>! In T268219#6685049, @razzi wrote: > @elukey: I confirmed that memcached was working based on the presence of superset_result keys in memcached. Perfect, this is great!... [16:30:37] mforns: nono there are creds for people in analytics-privatedata-users [16:30:57] ah ok! [16:31:02] elukey@stat1004:~$ ls -l /etc/mysql/conf.d/analytics-research-client.cnf [16:31:06] -r--r----- 1 root analytics-privatedata-users 120 Oct 13 08:40 /etc/mysql/conf.d/analytics-research-client.cnf [16:31:09] elukey: then will update the docs [16:31:09] there is also the research-client.cnf in there [16:31:15] but it will be deprecated :) [16:31:18] super [16:31:19] ok [16:31:24] :] [16:31:32] <3 thanks for the docs [16:34:52] (03CR) 10Mforns: [C: 04-1] "LGTM Overall! Left a couple comments on the config.yaml file." (033 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/645345 (https://phabricator.wikimedia.org/T260138) (owner: 10Andrew-WMDE) [16:35:41] elukey: could you help me load superset_production into superset_staging? Could be that easy as mysqldump, rsync, and load the file? [16:36:19] razzi: morning! yes for sure! So the staging db is on the same mariadb instance on an-coord1001, so it doesn't even need an rsync [16:36:39] ah nice [16:36:48] but the procedure is a little delicate, so it must be done with care, I can write some docs later on [16:37:13] so you can review, tell me what it is not clear, and then execut it [16:37:16] *execute [16:40:39] 10Analytics, 10Analytics-Kanban: AQS should be more resilient to druid nodes not available - https://phabricator.wikimedia.org/T268811 (10fdans) ohhh this is great to know @Pchelolo I was about to javascript my way through these two elements :) [16:44:17] elukey: ok, I'll keep an eye out for the docs. How did you fix the ssl cert for presto by the way? [17:02:38] razzi: ah so in the "databases" config (you can access that via the UI) there is a presto-analytics config, and if you open it at some point there is a json with some extra sqlalchemy settings [17:02:47] and one of them is related to the TLS CA to trust [17:03:30] ok cool, all through the ui I see [17:04:49] yep yep [17:04:53] (that gets saved into the db) [17:05:05] going to run a little errand, bbiab in a few! [17:14:22] 10Analytics, 10Product-Analytics, 10Product-Infrastructure-Data: Schema repository structure, naming - https://phabricator.wikimedia.org/T269936 (10Ottomata) > +1 in principle. That said, probably most of the instruments are going to be implemented somewhere in MediaWiki core or extensions. Do all of those l... [17:34:24] (03PS2) 10Fdans: Add Active Editors per Country metric to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/647792 (https://phabricator.wikimedia.org/T188859) [17:51:57] so as far as I can see, if I rollback from 2.10 to 2.8 the namenode automatically picks up the correct rollback procedure to do, even if I don't see traces of it in the logs [17:53:32] quick gutcheck: is it appropriate to use this process (https://wikitech.wikimedia.org/wiki/Analytics/Web_publication) for moving large (can-be-made-public) files from stat100x to Cloud VPS instances and then just deleting the file when the transfer is complete? idea being to avoid having to scp the file through my local laptop to move it [17:54:24] isaacj: in theory yes, but it depends how large :D [17:54:31] some GBs is fine [17:56:06] razzi: did you see the task that I opened about cookbook? Could be a nice coding thing to do [17:57:16] elukey: yeah, I can give the first one a go, sre.aqs.roll-restart [17:58:25] razzi: don't feel that you need to do them now, next week or even later is fine, is to give you a break from ops-only tasks :) [18:07:02] thanks elukey ! yeah, this is in the range of 10GB... [18:07:10] then it should be ok :) [18:07:20] please don't upload 10TBs :D [18:08:03] ok, yay! yeah, no worries :) [18:08:22] loading up my test cluster on Cloud VPS ;) [18:48:45] joal: bigtop 1.5 is running on hadoop test, all good afaics! [18:53:27] ahh camus fails [19:01:54] razzi: ok if we drop/restore the db on monday? Or is it blocking you? [19:17:05] elukey: could we give it a try today? Should be ok if we break the staging database (though it is my understanding that both staging and production databases are hosted by the same database server, so perhaps even staging is risky) [19:18:16] razzi: it is a little late in here this is why I am asking, I can quickly do it but the complete explanation might require a little bit (including docs etc..) [19:20:05] elukey: ok, no rush :) [19:20:05] Good to respect the late Friday [19:30:27] !log now ingesting Growth EventLogging schemas using event platform refine job; they are exclude-listed from eventlogging-processor. - T267333 [19:30:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:30:33] T267333: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 [19:30:39] razzi: so I added https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset#Sync_the_staging_database_with_the_production_one [19:31:14] if you want to check it then we can do it together on monday [19:31:39] (you driving and me double checking via meet) [19:31:44] elukey: cool. Sounds good [19:38:28] ack :) [19:39:21] * elukey afk! have a good weekend folks :) [20:25:18] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): HomepageVisit schema validation errors - https://phabricator.wikimedia.org/T269966 (10nettrom_WMF) I'm adding Analytics Engineering to this task, although there's nothing specific here for them to do. Instead, I'm wondering if the sudden drop... [20:26:10] 10Analytics, 10Analytics-Kanban, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 (10mforns) Hey all! The following schemas have been migrated successfully. * NewcomerTask * HomepageModule * HelpPane... [20:43:50] (03PS4) 10Ottomata: Move pageview filters to PageviewDefinition; add Webrequest.isWMFHostname [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646808 (https://phabricator.wikimedia.org/T256674) [20:44:27] (03CR) 10Ottomata: Move pageview filters to PageviewDefinition; add Webrequest.isWMFHostname (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/646808 (https://phabricator.wikimedia.org/T256674) (owner: 10Ottomata) [21:45:11] (03PS2) 10Ottomata: Refine using PERMISSIVE mode and log more info about corrupt records [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/647092 (https://phabricator.wikimedia.org/T266872) [21:50:28] (03PS3) 10Ottomata: Refine using PERMISSIVE mode and log more info about corrupt records [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/647092 (https://phabricator.wikimedia.org/T266872) [21:54:18] (03CR) 10Ottomata: Refine using PERMISSIVE mode and log more info about corrupt records (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/647092 (https://phabricator.wikimedia.org/T266872) (owner: 10Ottomata) [21:56:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Refine should report about malformed records and continue if possible - https://phabricator.wikimedia.org/T266872 (10Ottomata) Ok, in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/647092 I've made logging and error reporting about corrupt re... [21:57:30] (03CR) 10Ottomata: "I haven't yet run this exact code, but will soon. ready for review as is." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/647092 (https://phabricator.wikimedia.org/T266872) (owner: 10Ottomata) [22:07:10] PROBLEM - Check the last execution of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:08:14] (03PS4) 10Ottomata: Refine using PERMISSIVE mode and log more info about corrupt records [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/647092 (https://phabricator.wikimedia.org/T266872) [22:17:46] RECOVERY - Check the last execution of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:36:02] (03PS1) 10Fdans: Wikistats testing framework: Replace Karma with Rest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 [23:36:28] (03PS2) 10Fdans: Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 [23:37:08] (03CR) 10jerkins-bot: [V: 04-1] Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans)