[00:08:38] 10Analytics, 10Analytics-Cluster, 10Operations: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10Dzahn) [00:09:06] 10Analytics, 10Analytics-Cluster, 10Operations: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10Dzahn) After i ran apt-get clean it;s back to: /dev/md0 46G 39G 5.1G 89% / [02:36:48] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MediaWiki-Vagrant: How to use Wikipedia EventLogging schemas in Vagrant setup? - https://phabricator.wikimedia.org/T153641 (10Milimetric) TL;DR; It looks like the eventlogging vagrant install is in a funky state because of a distribution upgrade... [04:20:02] 10Analytics, 10Analytics-Kanban, 10DBA, 10Data-Services, and 3 others: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 (10Bstorm) > > I don't think that there honestly is a Cloud wide use case for these tables until we have a soluti... [07:07:45] morning! [07:07:52] so the hdfs balancer is still running [07:08:15] but I have the same doubt as yesterday, namely if it will ever finish [07:09:39] going to redo the calculations since some data has been dropped during the past day [07:09:51] so atm the overall usage is ~48% [07:10:52] that means that the balancer, if it calculates its threshold dynamicall after each iteration, is trying to keep each datanode's usage between 38% and 58% [07:11:47] there's only one worker left outside the window [07:11:55] so it might finish soonish [07:18:07] and I can see that the balancer is now pushing data to that node [07:37:47] !log manually stopped hdfs-balancer (cluster already balanced, only one host left with some blocks to get) to ease the decom of two more nodes [07:37:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:38:06] now the main question is if we want to let the balancer run during the weekend or not... [07:39:01] !log decommission analytics1031/32 from the Hadoop analytics cluster [07:39:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:40:09] in theory it should be fine, I'd leave it running [07:40:51] we can then re-assest the decision on Monday [07:43:46] ok 1031/32 are in decom process [08:16:08] !log restart eventlogging daemons on eventlog1002 to pick up openssl updates [08:16:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:54:28] Hi elukey - Indeed the balancer has done some awesome job in compressing the datanode-usage-histogram to its center :) [08:55:10] elukey: I think it'll still have work to do as we are decommissioning nodes, but it seems working :) [08:56:45] elukey: an interesting link about tuning spark jobs - https://towardsdatascience.com/how-does-facebook-tune-apache-spark-for-large-scale-workloads-3238ddda0830 [08:57:05] elukey: MOAR KNOBS to play with :) [08:57:48] \o/ [09:14:20] Hi fdans - let me know when you want to discuss data-quality stuff [09:18:49] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Decommission old Hadoop worker nodes and add newer ones - https://phabricator.wikimedia.org/T209929 (10elukey) [09:27:59] (03CR) 10Joal: [V: 03+2] Add direct kafka-to-druid ingestion example [analytics/refinery] - 10https://gerrit.wikimedia.org/r/480956 (https://phabricator.wikimedia.org/T203669) (owner: 10Joal) [09:40:36] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10JAllemandou) Job killed from `druid1001.eqiad.wmnet` using: ` # Get supervisor ID curl -L druid1001.eqiad.wmnet:8090/dr... [09:42:04] elukey: moved T203669 to done as I have killed the job and the supervisor code is merged in our repo [09:42:05] T203669: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 [09:42:25] !log Kill banner test kafka-druid ingestion job [09:42:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:44:39] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Active Editors metric per project family - https://phabricator.wikimedia.org/T188265 (10JAllemandou) [09:56:37] 10Analytics, 10Analytics-Kanban: [Spike] Spark job for digests-only mediawiki-history-reduced - https://phabricator.wikimedia.org/T212928 (10JAllemandou) [09:56:44] 10Analytics, 10Analytics-Kanban: [Spike] Spark job for digests-only mediawiki-history-reduced - https://phabricator.wikimedia.org/T212928 (10JAllemandou) a:03JAllemandou [09:57:30] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Active Editors metric per project family - https://phabricator.wikimedia.org/T188265 (10JAllemandou) [09:57:32] 10Analytics, 10Analytics-Kanban: [Spike] Spark job for digests-only mediawiki-history-reduced - https://phabricator.wikimedia.org/T212928 (10JAllemandou) [09:57:35] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Create report for "articles with most contributors" in Wikistats2 - https://phabricator.wikimedia.org/T204965 (10JAllemandou) [10:08:08] (03PS1) 10Joal: Update druid-webrequest jobs adding is_pageview [analytics/refinery] - 10https://gerrit.wikimedia.org/r/482277 (https://phabricator.wikimedia.org/T212778) [10:08:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add is_pageview as a dimension to the 'webrequest_sampled_128' Druid dataset - https://phabricator.wikimedia.org/T212778 (10JAllemandou) a:03JAllemandou [10:27:32] joal: ack! (sorry just seen the ping) [10:38:34] 10Analytics, 10Analytics-Kanban, 10DBA, 10Data-Services, and 3 others: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 (10Banyek) +1 on removing these tables as mentioned in T210693 too. [11:06:46] (03PS1) 10Joal: Correct typo in refinery-core [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/482284 [11:42:59] * elukey lunch [12:34:03] joal: so sorry I missed your ping, let's talk about it at some point this afternoon/evening? [12:34:12] sure fdans [12:34:19] merciii [13:26:43] 10Analytics, 10Product-Analytics: Metrics request on portal namespace usage - https://phabricator.wikimedia.org/T205681 (10AfroThundr3007730) No worries, I figured things would slow down over the holidays. > And your request had the bad fortune of being the first time that this issue in our existing data surfa... [13:41:44] joal: hola, i was looking at the geoeditors and *i think* this interval should be greater than what it is: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/load/coordinator.xml#L85 cc milimetric [13:42:07] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic: Support all Product Analytics data needs in the Data Lake - https://phabricator.wikimedia.org/T212172 (10Nuria) @chelsyx yes, it is scooped monthly [13:44:36] nuria: why? [13:44:54] nuria: the job has not raised any alarm [13:46:42] joal: shouldn't we have seen an alarm last month and the month prior though? data was not cooped by the 5th thus not ready couple days after [13:47:24] nuria: private sqoop for cu_changes it is, not labs-sqoop - Starts on the 1st of the month, and usually finishes early the 2nd [13:47:36] joal: ahahahaha [13:47:41] * nuria FORGOT! [13:47:57] :) [13:50:24] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10Nuria) 05Open→03Resolved [13:51:12] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10Nuria) Closing ticket as it did not seem FR was using this data, data source in turnilo is present but will not be upda... [13:52:30] (03CR) 10Nuria: [C: 03+2] Update druid-webrequest jobs adding is_pageview [analytics/refinery] - 10https://gerrit.wikimedia.org/r/482277 (https://phabricator.wikimedia.org/T212778) (owner: 10Joal) [14:06:12] (03CR) 10Nuria: [V: 03+2 C: 03+2] "Run tests and they work fine, merging." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/482284 (owner: 10Joal) [14:07:47] (03CR) 10Nuria: [V: 03+2 C: 03+2] Update druid-webrequest jobs adding is_pageview [analytics/refinery] - 10https://gerrit.wikimedia.org/r/482277 (https://phabricator.wikimedia.org/T212778) (owner: 10Joal) [14:12:33] 10Analytics, 10Analytics-Kanban, 10DBA, 10Data-Services, and 3 others: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 (10Banyek) On labsdb1010 this would be the quickest (with depooled host) `#!/bin/bash MYSQL="sudo mysql --skip-ssl... [15:04:21] 10Analytics, 10Analytics-EventLogging, 10Operations, 10ops-eqiad: db1107 has CRITICAL status in power supply - https://phabricator.wikimedia.org/T212910 (10Marostegui) [15:09:35] (03PS16) 10Mforns: Allow for custom transforms in DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) [15:10:21] (03CR) 10Mforns: "Do you think we should add a unit test file?" (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [15:11:44] 10Analytics, 10Analytics-EventLogging, 10Operations, 10ops-eqiad: db1107 has CRITICAL status in power supply - https://phabricator.wikimedia.org/T212910 (10Marostegui) p:05Triage→03Normal This happened around the same time as {T212909} maybe there was some work being done over those racks and the cable... [15:13:00] (03CR) 10jerkins-bot: [V: 04-1] Allow for custom transforms in DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [15:21:17] 10Analytics, 10Operations, 10ops-eqiad: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10Banyek) [15:21:21] 10Analytics, 10Analytics-EventLogging, 10Operations, 10ops-eqiad: db1107 has CRITICAL status in power supply - https://phabricator.wikimedia.org/T212910 (10Banyek) [15:21:23] 10Analytics, 10Operations, 10ops-eqiad: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10Banyek) [15:22:20] 10Analytics, 10Operations, 10ops-eqiad: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10Marostegui) [15:31:14] mforns: o/ [15:31:22] do you want to bc before/after standup? [15:33:18] elukey: question if you may [15:33:26] of course [15:33:57] elukey: i have started another turnilo on analytics-tool1002.eqiad.wmnet to test some changes on config [15:34:03] elukey: on port 9099 [15:34:21] elukey: but i cannot connect to it via ssh -N analytics-tool1002.eqiad.wmnet -L 9099:analytics-tool1002.eqiad.wmnet:9099 [15:35:11] smells like firewall rule, checking [15:39:25] elukey: ok [15:39:51] so the same rule with port 9100 doesn't work [15:40:01] (that should be the port that turnilo currently uses) [15:40:02] elukey: i have done this before to test turnilo changes so i .. ahem *think* it worked np [15:40:37] ah no snap 9091 [15:40:38] my bad [15:41:06] but same result [15:41:38] I believe that we explicitly allow only http for that host [15:42:58] elukey: could we change that? [15:43:10] 10Analytics, 10Analytics-Cluster, 10Operations: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10herron) Also... ` an-coord1001:~$ uptime 15:40:26 up 92 days, 1:05, 2 users, load average: 1177.70, 1166.05, 1131.88 ` Looks like loads of icinga check_disk processes in D sta... [15:43:15] elukey: so as to be able to test changes [15:43:22] elukey: in turnilo's config [15:43:55] elukey: let me know if you can think of a better way. [15:47:05] checking atm the ssh tunnel, I am wondering if Friday is tricking my brain [15:48:49] 10Analytics, 10Analytics-Cluster, 10Operations: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10herron) Also looks like /mnt/hdfs is hanging on this host, which would explain check_disk stacking up [15:56:30] 10Analytics, 10Analytics-Cluster, 10Operations: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10elukey) Thanks a lot for the task, I didn't see this today :( So two things: 1) the disk fills up due to logs, sadly there is a chatty systemd timer (hdfs-balancer) that emits logs... [16:00:51] sorry I got worried by --^ :) [16:06:22] nuria: the best way is of course not to start anything on a production host manually and test it in labs [16:07:01] or we could have a testing turnilo instance with related apache vhost [16:07:13] that we can easily change, restart, etc.. on the same node [16:07:43] but even that might be a problem since we allow http connections only from the caching nodes [16:10:30] 10Analytics, 10Analytics-Cluster, 10Operations: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10herron) Thanks @elukey! Is INFO level logging from hdfs-balancer needed? If not we might also be able to turn down the verbosity there. Definitely open to optimizing the logging c... [16:12:32] (03PS17) 10Mforns: Allow for custom transforms in DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) [16:13:02] elukey, yes we can finish talking about consumer groups [16:14:34] 10Analytics, 10Analytics-Cluster, 10Operations: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10elukey) Completely ignorant about autofs but it looks a very viable option, I am all for trying it :) The verbosity could be lowered indeed, not sure if possible judging from the da... [16:14:58] mforns: gimme 5 and I am in bc [16:15:03] elukey, sure [16:16:46] 10Analytics, 10Analytics-Kanban, 10Operations: Allow the deployment of users without SSH access - https://phabricator.wikimedia.org/T212949 (10elukey) p:05Triage→03Normal [16:52:03] (03CR) 10Milimetric: Update mediawiki-history comment and actor joins (034 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [16:55:05] 10Analytics, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Evaluate using TypeScript on node projects - https://phabricator.wikimedia.org/T206268 (10Milimetric) What I was implying by the headaches comment was that TypeScript just adds another step to the build. So the build sys... [16:59:25] elukey: ok, let's talk about doing this in labs as a PS [17:00:30] 10Analytics, 10Analytics-Kanban, 10Operations: an-coord1001 almost out of disk - https://phabricator.wikimedia.org/T212915 (10elukey) p:05Triage→03High [17:18:34] 10Analytics, 10Product-Analytics: Metrics request on portal namespace usage - https://phabricator.wikimedia.org/T205681 (10AfroThundr3007730) [17:33:20] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10AndyRussG) Dear @Nuria, @JAllemandou, @elukey, @mforns, Thank you so much for all your work on this. It is hugely appr... [17:36:29] (03CR) 10Joal: "Replies inline :)" (034 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [17:47:05] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10Nuria) @AndyRussG I think it will be worth to open a new ticket explaining what data you need and what it is used for.... [17:50:16] 10Analytics, 10Analytics-Kanban: Clean up staging db - https://phabricator.wikimedia.org/T212493 (10Marostegui) Sure, fine by me! [17:54:31] * elukey off! [17:56:10] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10DStrine) I think @Jseddon would still like this but the holiday work and time off have been a factor. I know a bit of h... [18:20:19] 10Analytics, 10Analytics-Kanban: Create staging domain for turnilo to test config changes - https://phabricator.wikimedia.org/T212958 (10Nuria) p:05Triage→03Normal [18:28:57] 10Analytics, 10Analytics-Kanban: unique devices monthly should be configured with default "monthly" granularity in turnilo - https://phabricator.wikimedia.org/T209103 (10Nuria) [22:15:08] (03PS18) 10Mforns: Allow for custom transforms in DataFrameToDruid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/477295 (https://phabricator.wikimedia.org/T210099) [22:43:07] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10AndyRussG) >>! In T203669#4855666, @Nuria wrote: > Either way, it will be helpful to open a ticket that explains in det...