[00:13:13] ebernhardson: thanks [00:14:05] ebernhardson: which schemas already exist now? who has experience with identifying the best points for adding the schemas at a particular wiki? [00:27:50] 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 2 others: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (10mobrova... [01:00:18] Sveta: schemas: https://meta.wikimedia.org/w/index.php?search=prefix%3Aschema%3A&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns470=1 [01:01:18] Sveta: various code that logs to schemas:https://codesearch.wmflabs.org/search/?q=mw.track.*event%5C.&i=nope&files=&repos= [01:58:41] thank you, ebernhardson [04:07:29] 10Analytics, 10Analytics-Kanban: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) @JAllemandou the title of this graph looks like it needs changing? [05:17:34] 10Analytics-Kanban, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10kzimmerman) [05:18:25] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10kzimmerman) [05:49:04] Wow I'm very sorry elukey - I should have trust your paranoid view yesterday :( [06:12:49] 10Analytics, 10Analytics-Kanban: Enable pagecount-ez cron on stats boxes - https://phabricator.wikimedia.org/T220012 (10elukey) Forgot to update: we have re-enabled two of the cron jobs that were disabled on March 26th, that hopefully should restore the pagecount-ez. [06:13:40] joal: bonjour! [06:13:44] 10Analytics, 10Analytics-Kanban: Enable pagecount-ez cron on stats boxes - https://phabricator.wikimedia.org/T220012 (10elukey) Judging from https://dumps.wikimedia.org/other/pagecounts-ez/merged/2019/2019-03/ it doesn't seem the case, checking further.. [06:14:02] nah you were right, it was only bad luck [06:36:02] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) >>! In T219842#5083320, @Nuria wrote: >... [07:09:46] (03PS1) 10QChris: Add .gitreview [analytics/wmde/WD/WD_identifierLandscape] - 10https://gerrit.wikimedia.org/r/501152 [07:09:49] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [analytics/wmde/WD/WD_identifierLandscape] - 10https://gerrit.wikimedia.org/r/501152 (owner: 10QChris) [07:29:11] 10Analytics, 10Analytics-Kanban: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10JAllemandou) >>! In T219910#5083908, @Nuria wrote: > @JAllemandou the title of this graph looks like it needs changing? > > https://grafana.wikimedia.org/d/000000538/d... [07:30:44] joal: https://grafana.wikimedia.org/d/000000027/kafka?refresh=5m&orgId=1&from=now-24h&to=now [07:30:49] looks better :) [07:34:35] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) @EBernhardson tried yesterday a modified... [07:34:38] Plenty thanks for that elukey - unstacking for the win!!! [07:43:04] ah joal if you have time today, I'd ask you to test in deployment-aqs01/2/3 if aqs is behaving [07:43:09] correctly [07:43:17] Moritz deployed in there a new version of nodejs [07:43:39] (IIRC you have the ops remaining-week, if not lemme know) [07:44:34] elukey: ack! Starting now [07:45:09] thanks1 [07:46:32] thx [08:01:05] elukey: quick question for you: IIRC we don't have a druid endpoint for beta, right? [08:03:07] elukey: it seems so [08:03:46] elukey, moritzm: I have manually requested data from AQS - It has worked as expected (data when available, error when no data etc) - Do you expect me to test something else? [08:05:27] if you think that covers typical usage, that's good enough for me, thanks. I'll upload the new nodejs packages to apt.wikimedia.org, then [08:06:00] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) @EBernhardson on stat1005 `mivisionx` was causing a broken apt... [08:06:17] we can then test then on aqs1004 [08:06:26] ack [08:06:43] moritzm: typical use-case has been tested except for the druid one, but I think we're safe (we'll monitor the restart and test with elukey) [08:06:52] Thanks elukey :) [08:07:00] super thanks joal [08:43:36] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10fdans) Acknowledged @Tbayer, the capsule fields are now loaded too in the test period. [08:58:50] 10Analytics, 10Operations, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10fgiunchedi) [08:59:36] team - I was feeeling not so good this morning and I actually have strong fever - I'll help with ops-related stuff (AQS restart and mediawiki-history follow up) but will not try to produce anything [08:59:51] man - this week is awful :( [09:03:43] The only thing I'd like to see moving is decision on whether to use a timestamp or null for user/page lineage first event based on real create events [09:11:23] joal: rest! [09:11:38] elukey: aqs? [09:11:41] aqs is fine, I'll do it later on [09:11:46] no hurry :) [09:11:47] base? [09:11:51] API? [09:12:04] ?? [09:12:08] sorry ... /me has fever and make poor jokes [09:13:03] elukey: only thin needed to be tested IMO opinion when changing on aqs1004 for real is an example of druid-querying endpoint [09:14:06] elukey: rest - base ... https://www.youtube.com/watch?v=q12oOIYHVDA [09:15:06] joal: sure but it can be done later on by Dan and me (or anybody else) so please rest :) [09:43:57] (03CR) 10Joal: "comment inline - Please try everything you want Dan :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/497604 (https://phabricator.wikimedia.org/T218463) (owner: 10Joal) [09:45:06] still in draft but https://etherpad.wikimedia.org/p/analytics-cdh5.16.1 looks better after using only cumin and cookbooks [09:48:54] (03CR) 10Addshore: WIP: count number of Wikidata edits by namespace (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE)) [09:53:07] elukey: pagecounts-ez job succeeded [09:53:44] elukey: https://dumps.wikimedia.org/other/pagecounts-ez/merged/2019/2019-04/ [09:53:48] ahh lovely I can see the giles [09:53:49] \o/ :) [09:53:50] *files [09:53:51] good :) [10:01:02] (03CR) 10Addshore: "This must be cherry picked onto the production branch in order to get deployed." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500717 (https://phabricator.wikimedia.org/T216835) (owner: 10Hoo man) [10:29:23] heya teammm [10:30:02] tizianop: o/ [10:30:35] hi! [10:32:28] I am currently re-enabling your account [10:32:30] running puppet now [10:32:40] before re-running your spark job, let's sync with joal [10:33:18] ok, thanks! [10:38:21] Heya - Let's debug that script [10:38:35] tizianop: Can try to explain briefly what the script does? [10:39:40] 10Analytics, 10Analytics-Kanban: Metric should not rotate if they are not available for the selected wiki - https://phabricator.wikimedia.org/T220083 (10fdans) [10:39:48] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Create report for "articles with most contributors" in Wikistats2 - https://phabricator.wikimedia.org/T204965 (10JAllemandou) Reading about this - Would delayed data be interesting? This information is accessible in hadoop :) [10:40:32] Hey joal, I think the problem is quite trivial. I just wanted to move to rearrange the rows of the dataset to have in the same parquet all the events of 1 user (partition by session_token) [10:41:30] joal: I'm not sure what you mean by "delayed data" in the top by number of editors metric ticket [10:41:31] tizianop: You're talking about webrequest-based data I assume, but we don't know how much [10:42:17] fdans: I commented on the wrong ticket o.O - sorry about that [10:42:24] fdans: deleting comment now [10:42:38] joal: thank youuuu! [10:42:41] :S [10:43:18] things to remember: paracetamol helps to feel better, but doesn't make the brain work [10:44:08] joal: https://meta.wikimedia.org/wiki/Schema:CitationUsage --> clean/anonymized subset of this [10:45:12] tizianop: I'm trying to get an idea of the datasize you'r playing with - CitationUsage schema on event DB I guess, timebounds? [10:47:02] tizianop: /wmf/data/event/CitationUsage/year=2019 contains 9.3Gb of data - This shouldn't be an issue [10:47:43] tizianop: it means something not expected (probably at join-timed) is making you spark job do unacceptable stuff :) [10:48:04] tizianop: account restored. Nit: your home on notebook1003 was backed up and removed, so I restored it. There is a file called 'piccardi.tar.bz2' in your home, please delete it if your home looks good :) [10:48:11] tizianop: By unacceptable, I mean physically unacceptable (too many files :) [10:50:05] joal: that table contained the raw data and it was completely deleted. I have to work with an anonymized dataset in my home folder (200GB) [10:50:11] 10Analytics, 10Analytics-Wikistats: First access to the detail page causes glitchy loading - https://phabricator.wikimedia.org/T220088 (10fdans) [10:51:55] joal: I wanted to reorganise the parquet files to have in the same node the data of the same user and reduce the shuffling (I do often groupBy("session_id")) [10:52:48] joal: but I didn't specify the number of partitions and I guess Spark created a parquet file for each session token :( [10:53:37] tizianop: partitioning in spark can be cumbersome - do you use RDDs or dataframes ? I assume you use dataframes (parquet) [10:53:57] if so, the default number of partitions to be used by spark is 200 [10:54:02] when doing computation [10:54:57] Spark will have as many partitions as there are HDFS blocks in the original dataset, and once shuffle is needed, use 200 if not explicitely set [10:55:03] joal: dataframes. I did this: pageview_events.repartition("session_id").write.partitionBy("session_id").parquet("pageview_events_by_user.parquet") [10:55:28] joal: hoping to keep the number of partition to 200, but it exploded :/ [10:55:36] The problem in the above is `.partitionBy("session_id")` [10:55:47] The rest is ok [10:56:01] `.partitionBy("session_id")` makes spark trying to write 1 folder for every session [10:56:51] yes, I should have used something like partitionBy("session_id", 200). I hope it's possible [10:56:58] And given that you use `.repartition("session_id")`, you should actually end up with as many partitions you had in the original dataset [10:57:04] tizianop: no [10:57:21] tizianop: you don't need to write multiple folders [10:58:03] by repartitioning based on session_id, spark will guarantee you that partitions contain all events for a given subset of sessions [10:58:09] Writing as is will help [10:58:25] Spark will still try to shuffle, but the work will be highly reduced [10:58:58] Or, you also can take advantage of the pre-partitioned data, and work the data manually to take advantage of the existing partitioning [10:59:40] tizianop: Do I make sense or not really|? [11:01:35] joal, yes absolutely! [11:02:34] I think for now I can the dataset in the current format [11:03:39] joal: I know you're super busy, but if you have a moment can we talk for a minute in the batcave? I've found something weird [11:03:59] tizianop: I'm super happy for you to try to repartition, but not partitionBy :) [11:04:02] fdans: OMW [11:04:47] fdans: I'm no busy, I'm slow :) [11:09:45] joal: thank you very much! I'll stay as far as possible from partitionBy :) [11:10:17] tizianop: please use it when needed, but keep in mind what it does ;) [11:10:29] perfect! [11:22:42] joal: very interesting [11:22:43] elukey@kafka-jumbo1003:~$ cat ifstat.log | awk '{print $1" "$3}' | egrep '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\ [6789]' [11:22:46] 07:52:12 89786.25 [11:22:48] 08:21:45 683320.2 [11:22:51] 08:21:46 651289.5 [11:23:04] so --^ are values, in Kbps, for the tx bandwidth of jumbo 1003 [11:23:18] that are above 600Mbps [11:23:28] ah no not the first one :D [11:23:35] I was about to say :) [11:23:37] ok [11:23:45] anyway, even close to ~700Mbps is a lot [11:23:55] any spike can easily saturate [11:24:29] elukey: are those values regularly above, or was there something special at 08:21? [11:25:28] regular is lower, not sure about 8:21 but probably some consumers pulling dat [11:25:34] the correct grep is cat ifstat.log | awk '{print $1" "$3}' | egrep '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]\ [6789][0-9]{5}' :) [11:25:43] uhuu [11:26:55] I'll let it run for some days to see if I can get more proofs [11:27:04] ifstat is currently in tmux on all jumbos [11:27:04] k [11:27:21] this one is candidate for moriel --^ [11:27:34] :) [11:27:46] ahahhah [11:52:33] 10Analytics, 10Analytics-Kanban: Set up edit_hourly data set in Hive - https://phabricator.wikimedia.org/T220092 (10mforns) [11:56:16] * elukey lunch! [12:02:35] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:03:41] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:05:55] Ah - I think this is druid related --^ indexation job for new snapshot just finished, leading to a lot of data being transfered to the druid historical nodes [12:15:32] hm - We have an issue with druid-public datapurge :( [12:23:14] (03PS1) 10Mforns: Add edit_hourly oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) [12:26:16] (03PS2) 10Mforns: Add edit_hourly oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) [12:28:13] (03PS3) 10Mforns: Add edit_hourly oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) [12:33:41] 10Analytics: Deal with truncated values in uniques - https://phabricator.wikimedia.org/T220098 (10fdans) [12:42:32] elukey: I have fun stuff for ou when you're b [12:42:34] back [12:55:27] elukey: ok, I think I have a beginning of answer, continuing to search :) [12:55:43] hi. there has been an Icinga alert about the status of "systemd unit refinery-sqoop-mediawiki" on an-coord1001 for like 2 days [12:56:32] Hi mutante - Arf - The job has failed but has been fixed manually - I'll ask elukey how to proceed to silence this alarm - Apologize for the noise :( [12:58:17] i can schedule a donwtime, just wondering for how long [12:58:46] mutante: I actualy don't know enough about timers to tell you - should it be until next run?A [13:01:15] here I am :) [13:01:25] Hi elukey - Sorry for the pings :S [13:01:28] mutante: I think that a simple systemctl reset-failed is enough, going to fix it [13:02:09] elukey: ah! well that is even easier. cool. i just scheduled a downtime for 2 weeks for both the speficic check and the general one about systemd broken [13:02:33] mutante: ah ok, going to remove it then :) [13:02:40] ok, thanks!:) [13:03:24] Thanks mutante :) [13:04:07] joal: what is the fun stuff? [13:04:28] elukey: the purge of public snapshot fails silently [13:04:46] elukey: on an-coord1001, less /var/log/refinery/drop-druid-public-snapshots.log.1 [13:05:16] elukey: even funnier, look precisely at log-lines timnestamp [13:05:42] elukey: HTTP calls make 1 to 2 minutes to fail [13:05:53] elukey: I think I have pinpointed the issue: ip-v6 [13:08:13] so likely the vlan firewall rules [13:08:27] very possible [13:08:57] have you tried to telnet? [13:09:53] yup - `telnet druid1004.eqiad.wmnet 8081` hangs, while `telnet -4 druid1004.eqiad.wmnet 8081` succeeds immediatly [13:10:12] elukey: --^ [13:10:47] RECOVERY - Check the last execution of refinery-sqoop-mediawiki on an-coord1001 is OK: OK: Status of the systemd unit refinery-sqoop-mediawiki [13:16:23] it does have ferm / iptables for IPv6 and that port [13:16:24] 0 0 ACCEPT tcp * * 2620:0:861:105::/64 ::/0 tcp dpt:8081 [13:16:33] i think it is router ACL then [13:17:06] yep just confirmed [13:17:06] (an-coord1001 is in 2620:0:861:105) [13:17:20] druid100[4-6] are not in the analytics vlan [13:17:28] so I need to modify the ipv6 filter to allow the traffic [13:17:43] joal: the drop script needs to return 1 when it fails though [13:17:52] elukey: indeed !!!! [13:19:42] elukey: I also need to solve the 307 issue [13:20:01] elukey: please let me know when the router is ok, so that I can test without waiting ages please :) [13:21:17] yep, will take me ~10m [13:21:29] np elukey - no rush [13:32:40] elukey@an-coord1001:~$ telnet druid1004.eqiad.wmnet 8081 [13:32:40] Trying 2620:0:861:101:10:64:0:35... [13:32:40] Connected to druid1004.eqiad.wmnet. [13:32:43] good :) [13:36:45] joal: ale [13:37:03] should work now :) [13:37:08] testing [13:37:26] confirmed with telnet elukey :) [13:37:42] Many thanks for that [13:37:46] super [13:59:52] 10Analytics, 10Analytics-Kanban: Fix druid-public drop-snapshot script - https://phabricator.wikimedia.org/T220111 (10JAllemandou) [14:01:51] yeah I was checking the code, we don't propagate any exception [14:02:00] I think we should simply remove the excepts [14:02:07] and let python do the work [14:02:13] (in case of deletes I mean [14:06:37] 10Analytics, 10Analytics-Kanban: Fix druid-public drop-snapshot script - https://phabricator.wikimedia.org/T220111 (10elukey) self.delete in utils.py should simply, in my opinion, return http exceptions if any so we can get them in the drop script and sys.exit(1) [14:07:42] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up edit_hourly data set in Hive - https://phabricator.wikimedia.org/T220092 (10mforns) [14:13:58] (03PS1) 10Hoo man: Track number of Schemas [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/501327 (https://phabricator.wikimedia.org/T216835) [14:14:20] (03CR) 10jerkins-bot: [V: 04-1] Track number of Schemas [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/501327 (https://phabricator.wikimedia.org/T216835) (owner: 10Hoo man) [14:16:43] (03PS1) 10Mforns: Add edit_hourly to list of tables to be purged of old snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501328 (https://phabricator.wikimedia.org/T220092) [14:17:00] (03PS2) 10Hoo man: Track number of Schemas [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/501327 (https://phabricator.wikimedia.org/T216835) [14:17:03] (03PS1) 10Hoo man: Fix PHP CodeSniffer [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/501329 [14:19:14] (03CR) 10Hoo man: "> This must be cherry picked onto the production branch in order to get deployed." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500717 (https://phabricator.wikimedia.org/T216835) (owner: 10Hoo man) [14:55:52] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) @elukey Nice, so the unbalanced partitios... [15:14:56] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) [15:31:55] are people going to the medium term planning meeting during standup/grosking? [15:31:58] a-team: ^ [15:32:15] I already followed the early one [15:32:24] (are we supposed to go to both?) [15:32:33] me too [15:32:38] no no both [15:32:40] ah okok [15:32:50] I went ot 1/2h to the first one [15:45:46] I think we all should to both meetings [15:53:30] I’m going to both, but because that’s the kind of thing I would do, not because I think everyone should. [15:55:39] hare: can't get enough of that sweet sweet medium term planning! [15:56:08] Implementing this thing is going to be my job for the next few years, I really want to know this stuff well! [15:56:27] nice [16:10:05] 10Analytics, 10Analytics-Kanban: Metric should not rotate if they are not available for the selected wiki - https://phabricator.wikimedia.org/T220083 (10Milimetric) ah yea, I did this as part of filtering out all the metrics, and forgot to look after we reverted the desktop filtering [16:11:25] 10Analytics, 10Analytics-Wikistats: Add an option to export the current graph into image file - https://phabricator.wikimedia.org/T219969 (10Milimetric) [16:13:05] 10Analytics: Research wether we can throttle the number of files created by a job so namenode does not get overwhelmed - https://phabricator.wikimedia.org/T220126 (10Nuria) [16:23:35] Need help with the EventLogging beta server if anyone's around. Got events which are being received but not validated. They look fine to me but something must be off, but unfortunately error logging (/var/log/eventlogging/eventlogging-processor@client-side-0*.log) seems to be broken right now so we can't see what the issue is. Last messages in those two files are related to Kafka. [16:29:20] bearloga: logs should be under srv IIRC [16:29:35] /srv/log/eventlogging/systemd [16:30:12] there are a ton of errors in the processors, I think it might what you are looking for [16:31:25] elukey: thanks, I'll check it out! [16:33:12] np! [16:33:19] lemme know if anything doesn't work [16:48:04] 10Analytics: Research wether we can throttle the number of files created by a job so namenode does not get overwhelmed - https://phabricator.wikimedia.org/T220126 (10fdans) p:05Triage→03High [16:51:58] 10Analytics, 10Analytics-Kanban: Fix druid-public drop-snapshot script - https://phabricator.wikimedia.org/T220111 (10fdans) p:05Triage→03High [16:53:38] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [16:53:45] 10Analytics, 10Analytics-Wikistats: Add an option to export the current graph into image file - https://phabricator.wikimedia.org/T219969 (10fdans) p:05Triage→03Normal [16:54:29] 10Analytics: Deal with truncated values in uniques - https://phabricator.wikimedia.org/T220098 (10fdans) p:05Triage→03Normal [16:54:40] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) With the changes in packages now trying to run any model... [16:54:56] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [16:55:47] ebernhardson: sigh [16:56:02] IIUC we can have either miopen-hip [16:56:06] joal, elukey: FYI I'm running a script to do what I discussed before (https://pastebin.com/qdJWtWLD). In this case, I add a field (max 200 values) and I partition on that. I already tried on a small sample and it works fine. Let me know if it creates any problem [16:56:09] or miopen-opencl [16:56:20] tizianop: sure :) [16:56:28] thanks for the heads up [16:58:29] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) A solution could be to remove `mivisionx` (not sure if needed)... [16:59:00] elukey: isn't packaging fun :) I can play whack-a-mole trying package combos later today [16:59:13] ahhaha yes yes [16:59:32] we can also try to push upstream to solve the issue [17:00:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up edit_hourly data set in Hive - https://phabricator.wikimedia.org/T220092 (10fdans) p:05Triage→03High [17:00:24] elukey: i dunno how tennable it would be, but they also provide docker images that are supposed to have all the appropriate packages installed and "just works" [17:00:34] and that would avoid the installing ubuntu packages to debian issue [17:01:14] ebernhardson: IIUC this problem is re-using the same path for header files noe? [17:01:17] *no ? [17:01:26] I mean, it should happen even on docker [17:01:52] 10Analytics, 10Analytics-Wikistats: First access to the detail page causes glitchy loading - https://phabricator.wikimedia.org/T220088 (10fdans) p:05Triage→03High [17:02:36] it seems that upstreams suggests to either use HIP or OpenCL [17:02:45] they don't seem to contemplate both at the same time [17:03:02] elukey: well, on docker we dont install anthing, they have pre-installed everything so assuming it would have the correct things in the right places [17:03:11] 10Analytics, 10Analytics-Kanban: Metric should not rotate if they are not available for the selected wiki - https://phabricator.wikimedia.org/T220083 (10fdans) p:05Triage→03High [17:03:44] ebernhardson: sure, but if you install rocm-dev (or whatever is called I don't remember :) everything works, the issue was a dependency of mivisionx [17:03:50] 10Analytics, 10Operations, 10netops: Allow swift https access from analytics to prod - https://phabricator.wikimedia.org/T220081 (10fdans) p:05Triage→03Normal [17:03:53] no? [17:04:20] what I am saying is that maybe even on docker everything explodes with rocm stuff + mivisionx [17:04:25] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint), 10Patch-For-Review: Add HelpPanel schema to the EventLogging whitelist - https://phabricator.wikimedia.org/T220033 (10fdans) p:05Triage→03Normal [17:04:54] 10Analytics, 10Analytics-Kanban: Enable pagecount-ez cron on stats boxes - https://phabricator.wikimedia.org/T220012 (10fdans) 05Open→03Resolved [17:04:54] elukey: not sure :) ldd /opt/rocm/lib/libMIOpen.so.1 reports it needs libOpenCL.so.1, but running tensorflow when it loads the library complains about ImportError: /opt/rocm/lib/libMIOpen.so.1: version `MIOPEN_HIP_1' not found [17:05:00] 10Analytics, 10Analytics-Kanban: Enable pagecount-ez cron on stats boxes - https://phabricator.wikimedia.org/T220012 (10fdans) [17:05:05] elukey: that suggests to be that libMIOpen is expecting to find opencl and hip [17:05:19] 10Analytics, 10Analytics-Kanban: Enable pagecount-ez cron on stats boxes - https://phabricator.wikimedia.org/T220012 (10fdans) p:05Triage→03High [17:05:56] i might need to dig more into this to see what's going on, right now just looking at error messages without a ton of contxt [17:06:21] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Investigate if kafka can decline requests to consume from consumers that support an older protocol - https://phabricator.wikimedia.org/T219936 (10fdans) p:05Triage→03High [17:06:28] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: Investigate if kafka can decline requests to consume from consumers that support an older protocol - https://phabricator.wikimedia.org/T219936 (10fdans) p:05High→03Normal [17:06:59] ebernhardson: lemme try one thing [17:07:50] 10Analytics, 10Operations, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10fdans) p:05Triage→03Normal [17:08:01] mivisionx might not be required [17:08:16] so i tried to install miopen-hip [17:08:17] dpkg: error processing archive /var/cache/apt/archives/miopen-hip_1.7.1_amd64.deb (--unpack): [17:08:20] trying to overwrite '/opt/rocm/miopen/include/miopen/config.h', which is also in package miopen-opencl 1.7.1 [17:08:29] 10Analytics, 10Operations, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move eventstreams logging to new logging pipeline - https://phabricator.wikimedia.org/T219922 (10fdans) p:05Triage→03Normal [17:08:36] this is independent from ubuntu docker etc.. [17:10:11] elukey@stat1005:~$ dpkg -S /opt/rocm/miopen/include/miopen/config.h [17:10:12] miopen-opencl: /opt/rocm/miopen/include/miopen/config.h [17:10:15] 10Analytics: Outdated project codes in pagecounts-ez - https://phabricator.wikimedia.org/T219914 (10fdans) p:05Triage→03Normal [17:11:23] elukey: it looks like libMIOpen.so.1 comes from miopen-opencl, i guess we could try removing that and installing miopen-hip instead? not sure which is preferred [17:12:29] ebernhardson: it is mivisionx that requires miopen-opencl, so in theory removing it will allow us to use -hip [17:12:41] elukey: found "that means no, miopen-opencl functionality is not supported within TF." https://github.com/RadeonOpenCompute/ROCm/issues/703#issuecomment-462598966 [17:12:46] elukey: suggests we only want -hip version? [17:12:51] ah! [17:12:54] lol [17:12:55] yes [17:13:06] shall we remove mivisionx ? [17:13:10] yea [17:13:13] doing [17:14:03] ebernhardson: should be done [17:14:44] elukey: mnist.py started at least :) will take a moment to run [17:14:54] it's training, looks like that is the combo of packages we need [17:15:31] gooood [17:17:07] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10fdans) @MusikAnimal is this report coming from users getting 429? Are you getting the errors yourself? [17:17:18] 10Analytics, 10Pageviews-API, 10Tool-Pageviews: 429 Too Many Requests hit despite throttling to 100 req/sec - https://phabricator.wikimedia.org/T219857 (10fdans) p:05Triage→03Normal [17:17:54] 10Analytics, 10Reading Depth, 10Readers-Web-Backlog (Tracking): [Bug] Many ReadingDepth validation errors logged - https://phabricator.wikimedia.org/T216063 (10fdans) p:05Triage→03Normal [17:18:09] 10Analytics: Refine eventlogging pipeline should not refine data for domains that are not wikimedia's - https://phabricator.wikimedia.org/T219828 (10fdans) p:05Triage→03High [17:34:05] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [17:36:37] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [17:39:10] our dc's must be cold, gpu reports using it's full power availability (160-170W out of 170W allowed), but only 50C and 40-50% fan speed [17:39:23] or at least the cases well ventilated :) [17:40:02] for comparison the gpu in my desktop machine upstairs often reports 80C and 80%+ fan when doing tensorflow things [17:47:03] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) per https://github.com/RadeonOpenCompute/ROCm/issues/703... [17:54:49] ebernhardson: :D [18:08:05] (03CR) 10Mforns: Add HelpPanel to EventLogging whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501045 (https://phabricator.wikimedia.org/T220033) (owner: 10Nettrom) [18:10:37] * elukey off! [18:19:55] nuria, so then, should I make the edit_hourly edit_daily? [18:45:12] mforns: I think if we do not plan to query hourly edit_daily would work [18:45:50] mforns: but if you see a use case we are missing doing it this way please go with your initial idea [18:45:59] mforns: my concern was perf [18:46:12] mforns: but there might not be any perf issues [18:49:42] nuria, I think there will not be any perf issues for now, but if I see slowliness in druid I will switch to daily, is that OK? [18:59:34] huh, stat1005 has more cores and more perf per core than 1007. Guess i might as well just run the CPU comparrison there [19:02:52] mforns_brb: k [19:07:47] mforns_brb: sounds good [19:15:13] first order comparison: resnet50 does 4.1 images/sec training on cpu, 138 images/sec on the WX9100 [19:16:27] 10Analytics, 10Analytics-Kanban: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) Closed pull request and opened another one with hopefully the right changes: https://github.com/wikimedia/restbase/pull/1109 [19:34:20] (03PS4) 10Mforns: Add edit_hourly oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) [19:58:35] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10Tbayer) Great - the capsule dimensions look good to me. It doesn't yet seem possible to switch to a daily time series, perhaps that is an artifact of... [20:00:05] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [20:01:23] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [20:08:58] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10CDanis) FWIW here's network traffic on kafka-jum... [20:13:10] (03CR) 10Nuria: "Let's please not merge this until we are sure we want edit_hourly and not edit_daily" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501328 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [20:13:30] nuria, sure [20:13:35] will mark it as WIP [20:13:51] mforns: superfast ninja response, ka-chaaa! [20:14:00] xD [20:14:35] (03CR) 10Mforns: [C: 04-2] "Still testing!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [20:15:29] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [20:15:37] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Test sqooping from the new dedicated labsdb host - https://phabricator.wikimedia.org/T215550 (10Nuria) [20:15:44] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Test sqooping from the new dedicated labsdb host - https://phabricator.wikimedia.org/T215550 (10Nuria) 05Open→03Resolved [20:16:40] 10Analytics, 10Analytics-Kanban, 10WMDE-Analytics-Engineering: Pyspark2 fails to read.csv when run with spark2-submit - https://phabricator.wikimedia.org/T217156 (10Nuria) 05Open→03Resolved [20:16:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve speed and reliability of Yarn's Resource Manager failover - https://phabricator.wikimedia.org/T218758 (10Nuria) 05Open→03Resolved [20:28:54] 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (10dduvall... [20:29:34] 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Doing), and 3 others: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (10dduvall... [20:39:15] mforns: if i look at edits_hourly in turnilo there is still no data right? [20:39:25] nuria, no [20:39:59] I'm executing the edit_hourly job right now with the latest changes, to populate wmf.edit_hourly in Hive [20:40:20] and then I'll start testing the druid ingestion job [20:40:51] the wmf.edit_hourly job just finished [20:41:52] mforns: ok, was going to do some testing in superset with this datasource , will do once you are done [20:42:42] nuria, ok, if I get it working today I ping you, otherwise tomorrow [20:43:45] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [20:44:48] (03PS5) 10Mforns: Add edit_hourly oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) [20:45:19] mforns: ya, no problem finally i can do the testing of superset and can use many other things [20:45:23] (03Abandoned) 10Mforns: Modify mediawiki/history/druid job to ingest a simpler data set to druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499917 (https://phabricator.wikimedia.org/T211173) (owner: 10Mforns) [20:45:44] k [20:47:01] (03PS2) 10Mforns: Add edit_hourly to list of tables to be purged of old snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501328 (https://phabricator.wikimedia.org/T220092) [20:48:31] 10Analytics, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10AndyRussG) Hi! Thanks so much!!! Here's what Hive said about the `event` field in the `event/centralnoticeimpression` table: ` event struct<... [20:53:25] 10Analytics, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10Nuria) Let's see, this data comes from eventlogging, in order for it to be useful we need to make sure FR-tech has switched to eventlogging being... [21:07:34] AndyRussG: nice work on the pageview titles, many thanks [21:18:30] (03CR) 10Hashar: "Now I am confused since I did the exact same change previously: https://gerrit.wikimedia.org/r/#/c/analytics/wmde/scripts/+/498831/1/src/r" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/501329 (owner: 10Hoo man) [21:19:04] (03CR) 10Hashar: [C: 03+1] "Oh sorry that is for the "production" branch. My bad." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/501329 (owner: 10Hoo man) [21:20:51] !log Restarted turnilo to clear deleted datasource [21:21:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:43:11] (03PS8) 10Bmansurov: Oozie: add article recommender [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) [21:44:12] (03CR) 10Bmansurov: "The latest patch has succeeded [1]." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [22:09:33] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up edit_hourly data set in Hive - https://phabricator.wikimedia.org/T220092 (10Neil_P._Quinn_WMF) [22:09:43] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: "Edit" equivalent of pageviews daily available to use in Turnilo and Superset - https://phabricator.wikimedia.org/T211173 (10Neil_P._Quinn_WMF) [22:20:31] (03CR) 10Nuria: Add edit_hourly oozie job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [22:24:01] 10Analytics, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10AndyRussG) >>! In T217109#5086615, @Nuria wrote: > Let's see, this data comes from eventlogging, in order for it to be useful we need to make sur... [22:50:59] fun little puzzle for you spanish speakers: how do you say "disembodied" in spanish? [22:52:04] 10Analytics, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10Nuria) >The events have been left on at 0.01% sample rate (hope that's OK) Yes, of course. Once you are ready to switch pipelines let us know. [23:06:03] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) Synthetic benchmarks of runtime performance of CNN train... [23:11:04] (03CR) 10Bmansurov: Oozie: add article recommender (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [23:11:32] 10Analytics: Move FR banner-impression jobs to events (lambda) - https://phabricator.wikimedia.org/T215636 (10Nuria) Per @AndyRussG's reply in: https://phabricator.wikimedia.org/T217109 the FR-tech team has not yet moved to the eventlogging pipeline for events , once that happens we can easily product-ionize a r... [23:11:45] 10Analytics: Move FR banner-impression jobs to events - https://phabricator.wikimedia.org/T215636 (10Nuria) [23:37:54] (03CR) 10Nettrom: ">" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501045 (https://phabricator.wikimedia.org/T220033) (owner: 10Nettrom) [23:42:18] (03CR) 10Milimetric: "this is looking really good, but I haven't tested it yet and I'm still shocked that all the breakdown logic is so much simpler now. It mak" (036 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/498002 (owner: 10Fdans)