[00:03:42] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is OK: CRITICAL https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [00:04:09] (03CR) 10Nuria: "Corrected couple bugs but mostly left new code as is." (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572726 (https://phabricator.wikimedia.org/T245453) (owner: 10Joal) [00:04:46] Nettrom: indeed [00:40:00] Nettrom: i cannot fix notebooks cause i do not have enough permits [00:40:08] Nettrom: to wack big dirs [00:53:45] nuria: uff, that's a bummer [00:54:09] groceryheist: sounds like you could delete somedata from notebook1004 [00:57:59] nuria: thanks for looking into it, though [01:32:38] nuria: I removed a few GB [03:42:27] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Unify puppet roles for stat and notebook hosts - https://phabricator.wikimedia.org/T243934 (10Nuria) [03:43:18] 10Analytics, 10Research: Add SWAP profile to stat1005 - https://phabricator.wikimedia.org/T245179 (10Nuria) [05:11:53] 10Analytics, 10Analytics-Wikistats, 10translatewiki.net, 10Patch-For-Review: Add stats.wikimedia.org to translatewiki.net - https://phabricator.wikimedia.org/T240621 (10abi_) Related patch has been deployed on translatewiki.net and Wikistats 2.0 is now available for translation on twn. Pending items, 1.... [07:16:42] 11:16:10 PM @milimetric, for the dumps, the coordinator.properties and .xml refer to mw_raw_directory and mw_private_directory. Look: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/bucketed/coordinator.xml . If I no longer reference the mw_project_namespace_map, is there any need for the other mw_ properties? [08:33:49] (03CR) 10Joal: Fix webrequest host normalization (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572726 (https://phabricator.wikimedia.org/T245453) (owner: 10Joal) [08:37:20] (03PS4) 10Joal: Fix webrequest host normalization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572726 (https://phabricator.wikimedia.org/T245453) [09:33:09] joal: bonjour! [09:33:14] Hi elukey :) [09:33:18] time has come to roll restart everything for jvm upgrades [09:33:28] ok if I proceed with hadoop? [09:33:43] it feels last time was not long ago - let's go :) [09:34:04] it is every 3/4 months now, for our immense joy [09:36:52] joal: btw, did you see https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&fullscreen&panelId=25&from=now-30d&to=now ? [09:37:23] nope :( [09:38:10] elukey: how long do we keep the hdfs trash? [09:38:13] 1 month is it? [09:39:06] yep [09:39:18] (roll restart started) [09:39:31] elukey: I think event data dropped by Andrew has not yet been removed [09:42:12] could make sense yes [09:43:59] elukey: 200+Tb in analytics trash [09:44:24] elukey: in date of 2020-02-14 [09:44:35] elukey: do you wish me to force drop those? [09:45:02] also elukey - 200+tb real, not hdfs-used(no need to do *3) [09:48:29] oh my, yes please go [09:48:42] ack elukey [09:49:42] elukey: I confirmed all this data comes from the events Andrew and Marcel deleted [09:50:04] !log Force delete old api/cirrus events from HDFS trash to free some space [09:50:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:50:23] (03PS6) 10Fdans: Add language selection functionality to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/564047 (https://phabricator.wikimedia.org/T238752) [09:50:30] (03CR) 10Fdans: Add language selection functionality to Wikistats (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/564047 (https://phabricator.wikimedia.org/T238752) (owner: 10Fdans) [09:50:48] elukey: last chance before deletion :) [09:51:00] nono I am good [09:51:05] ack ! [09:52:16] (03CR) 10Fdans: "thank you for the patch @MarcoAurelio and the review @Jforrester" [analytics/wikistats2] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/573970 (https://phabricator.wikimedia.org/T245805) (owner: 10MarcoAurelio) [10:05:05] elukey: data dropped - No more gain possible from trash [10:05:33] * elukey dances [10:05:48] elukey: still 1.63 Pb [10:07:25] 200Tb in user folders, 1.4Pb in /wmf/data [10:07:27] well it is better than before [10:07:55] in /wmf/data/ - 75Tb in archive, 400Tb in raw, 850Tb in wmf [10:09:01] And in /wmf/data/wmf/ - 450Tb for webrequest and 275Tb for mediawiki [10:10:40] elukey: will drop a version of wikitext (75Tb) - Keeping the last 2 [10:12:29] joal: I think it is fine to avoid dropping more, there is plenty of space, it was just to spot if there was a trend of data growing too fast [10:13:20] elukey: I think the data increase was the new wikitext snapshot [10:13:33] elukey: We should devise a different system for this [10:14:52] 10Analytics: create kerberos identity for jmorgan - https://phabricator.wikimedia.org/T246118 (10elukey) 05Open→03Resolved a:03elukey [10:35:47] I just tested PyHive with recent commits and it works with presto+kerberos [10:35:53] Where can I parse the dumps, and what's the new format? [10:35:54] not even on notebooks [10:36:00] *even [10:39:40] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Kerberize Superset to allow Presto queries - https://phabricator.wikimedia.org/T239903 (10elukey) The following works even on Notebooks: ` # !pip install git+https://github.com/dropbox/PyHive.git@437eefa7bceda1fd27051e5146e66cb8e4bdfea1 # !pip install requests... [10:58:21] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Kerberize Superset to allow Presto queries - https://phabricator.wikimedia.org/T239903 (10elukey) SAML support seems to have not got traction in FAB (used by superset): https://github.com/dpgaspar/Flask-AppBuilder/issues/1028 [11:22:12] wow people are already translating Wikistats like there's no tomorrow - https://translatewiki.net/wiki/Translating:Wikistats_2.0 [11:39:33] good :) [11:39:35] * elukey lunch! [12:32:14] Hello djellel [12:33:00] djellel: you'll find some doc here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/XMLDumps/Mediawiki_wikitext_history [12:33:40] djellel: I suggest trying your parsing/data extraction on small wikis first as testn [12:34:59] elukey: PyHive news is exciting :) [12:35:11] I want to play with Presto in notebooks! [12:37:07] Hallo [12:37:19] Hi aharoni [12:56:37] I had a question about hive, but figured it out myself :) [12:56:57] ok aharoni :) [13:07:05] o/ [13:08:14] joal: if I had a crazy idea, and said something like... I wanted to scrape every URL used on a page on a wikimedia project, and grab any semantic data (schema.org stuff), and create an api allowing access to that, what would your imediate thoughts be? [13:09:27] Hi addshore - per-page means you're not after historical data, and you want to keep up-to-date fast-enough I guess [13:09:54] not after historical data no [13:10:50] the idea being then, to be able to go from these facts defined in web pages semanticaly, that are linked to from wikipedia, and link them with statements that are in wikidata [13:11:34] addshore: while I kinda have an idea of what you just wrote, I'm not gonna dive in that just now :) [13:16:34] addshore: Parsing the wikitext can be done on hadoop on a monthly basis - More regular updates needs some more infra we don't have yet [13:16:56] then the serving bit will highly depend on how you want to access the data [13:17:19] welll, not talking about wikitext, rather the external links, and the content of those pages linked to :P [13:17:57] addshore: external-links as in pagelinks table? [13:19:14] well, as in the external links tables ;) https://www.mediawiki.org/wiki/Manual:Externallinks_table [13:19:27] right :) [13:20:58] addshore: So from external-links to a facts-oriented view of external pages - IMO it involves crawling :) [13:21:07] yup :P [13:21:26] im guessing analytics doesnt do anything at all like that yet? ;) [13:21:44] nope [13:22:06] external crawling is not in the toolbox [13:22:24] is there such a toolbox? :P [13:23:00] There are OSS crawlers out there, but I have not used them so can't give opinion [13:23:09] ack! thanks! [13:23:15] np addshore [13:24:56] 10Analytics, 10Operations, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10ayounsi) 05Resolved→03Open It's at 0% now and alerting. [13:25:06] :( [13:25:17] no noes :P [13:25:40] elukey: would you by any chance be nearby? [13:38:54] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Add pageviews total counts to WDQS - https://phabricator.wikimedia.org/T174981 (10Gehel) Adding this amount of data to WDQS does not seem to be a good idea. We might want to redefine the higher level problem that we are trying to address here,... [13:39:00] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Add pageviews total counts to WDQS - https://phabricator.wikimedia.org/T174981 (10Gehel) 05Open→03Declined [13:50:01] joal: I am now [13:50:48] elukey: notebook1003 is full :( [13:50:57] elukey: How can I help [13:51:29] it is also 1004 [13:53:17] I am looking forward to just deprecate the notebooks [13:53:25] +1000 [13:53:26] and use the space on the stats [13:53:43] I would argue that it is usually the same people that keep using more home space than allowed [13:54:01] elukey: Please do so - You have gathered enough evidence [13:54:17] checking now... [13:55:34] ah no joal is 1004 [13:55:36] 1003 is ok [13:56:55] 10Analytics, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice: Refining is failing to refine centranoticeimpression events - https://phabricator.wikimedia.org/T244771 (10Ottomata) Ah Nuria, sorry I !logged in #wikimedia-analytics IRC and expected it to get posted here (like -opereations does). Th... [14:00:49] ok so 1004 has the root partition full [14:00:59] I just freed some space with apt-get clean [14:02:03] Hello World [14:02:25] Q: is /srv/published/datasets/ on stat1005 mapped onto https://analytics.wikimedia.org/published/datasets/? [14:03:12] hello! yes it should [14:04:49] it takes a bit to sync though, it is not instant [14:13:12] elukey: Thanks! [14:16:50] Hi! I'm interested in the number of active editors per language Wikipedia. What's the easiest way to get that data? I see I can pull it up for each individual wiki via Wikistats, but I want to sum the total across all Wikipedias. [14:19:48] joal: ouch [14:19:50] elukey@notebook1004:/$ sudo du -hs tmp/ [14:19:51] 26G tmp/ [14:20:25] Samwalton9: maybe https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 ? [14:37:25] Useful, thanks! [14:49:50] 10Analytics, 10WMDE-Analytics-Engineering: Upgrade to R version >= 3.6.0 on stat1005 - https://phabricator.wikimedia.org/T246224 (10GoranSMilovanovic) [14:50:36] I don't think this would take too much time but it would save me a lot, really: https://phabricator.wikimedia.org/T246224 Thanks [14:52:49] GoranSM: I am a little bit confused - you have been working on 3.3 on stat1007 and now you'd need 3.6 on 1005 ? [14:52:54] 10Analytics, 10WMDE-Analytics-Engineering: Upgrade to R version >= 3.6.0 on stat1005 - https://phabricator.wikimedia.org/T246224 (10GoranSMilovanovic) [14:53:21] Yes. However: on stat1007, with R 3.3.0, I have the older versions of the packages installed. [14:54:04] But stat1007 is typically full and one of my processes was just killed half an hour ago or so. So I've decided to move to stat1005. But now the packages have evolved and I need R >= 3.6.0 [14:54:09] the main issue is that we rely on R from Debian, and currently stat1005 is on buster, so the most up to date from debian upstream [14:54:33] o/ [14:54:36] I am not sure if we can package R and upgrade, it is a big effort [14:54:56] GoranSM: can you try stat1004 instead to unblock yourself? Same env as stat1007 in theory [14:55:41] ottomata: o/ [14:55:47] I am back on stat1007, stat1004 does not have enough resources (RAM, cores) for I need to do. I am updating http://wmdeanalytics.wmflabs.org/WD_LanguagesLandscape/ which is huge - least to say. [14:57:09] sure [14:57:11] elukey: What I really need is a containerized R for WMDE Analytics, but now it is not time to discuss that, obviously. [14:57:27] elukey: to avoid this and similar problems in the future [14:57:33] GoranSM: i hope in the next few months we can start using anaconda and conda envs [14:57:41] for which you should be able to install your own R stuff into your conda eenv [14:58:13] ottomata: that would be great. I typically do not use R with Anaconda (but I maintain Python envs there), but - whatever solution turns out to work, I am fine with it. [15:00:55] 10Analytics, 10WMDE-Analytics-Engineering: Upgrade to R version >= 3.6.0 on stat1005 - https://phabricator.wikimedia.org/T246224 (10elukey) 05Open→03Declined Had a chat with Goran on IRC, he is going back to stat1007 for the moment. [15:04:27] ottomata: if you have time today https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/574722/ [15:04:36] (to move RU from stat1006 to an-laucher) [15:06:19] !log dropped and re-added backfilled partitions on event.CentralNoticeImpression table to propogate schema alter on main table - T244771 [15:06:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:06:23] T244771: Refining is failing to refine centranoticeimpression events - https://phabricator.wikimedia.org/T244771 [15:13:48] 10Analytics, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice: Refining is failing to refine centranoticeimpression events - https://phabricator.wikimedia.org/T244771 (10Ottomata) Ok fixed. The problem was that even though the table had the correctly ALTERed schema, each pre-existing partition st... [15:17:31] 10Analytics: Refine should DROP IF EXISTS before ADD PARTITION - https://phabricator.wikimedia.org/T246235 (10Ottomata) [15:31:41] @Everyone: one of my R scripts is using a lots of resources on stat 1007 right now (approx. 32Gb of RAM). Please do not kill it - it is updating the single most complex system that we maintain in WMDE: http://wmdeanalytics.wmflabs.org/WD_LanguagesLandscape/. The script is optimized so to return all of the resources that it uses back to the OS as soon as it does not have to use them anymore. Thank you for your understanding. [15:33:20] It just got killed on stat1007. [15:34:29] PROBLEM - Check the last execution of refinery-import-siteinfo-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:34:31] PROBLEM - Check the last execution of archive-maxmind-geoip-database on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:35:03] PROBLEM - Check the last execution of refinery-import-wikidata-all-json-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:35:39] PROBLEM - Check the last execution of wikimedia-discovery-golden on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:37:04] GoranSM: nobody killed you manually, it was the oom of the OS [15:37:59] PROBLEM - Check the last execution of refinery-import-page-history-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:38:03] PROBLEM - Check the last execution of refinery-import-wikidata-all-ttl-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:40:31] GoranSM: your script is - literally- breaking other software that runs there [15:41:57] nuria: I am doing my best, currently introducing an additional optimization. I've tried to migrate to stat1005 but the R version there is behind what I need for some package dependencies. [15:43:46] GoranSM: sounds like you need to do quite a bit more work in optimizing your scripts [15:45:33] RECOVERY - Check the last execution of refinery-import-siteinfo-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-siteinfo-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:45:35] RECOVERY - Check the last execution of archive-maxmind-geoip-database on stat1007 is OK: OK: Status of the systemd unit archive-maxmind-geoip-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:46:07] RECOVERY - Check the last execution of refinery-import-wikidata-all-json-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-wikidata-all-json-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:46:43] RECOVERY - Check the last execution of wikimedia-discovery-golden on stat1007 is OK: OK: Status of the systemd unit wikimedia-discovery-golden https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:47:44] nuria: The system needs to deal with contingency tables of high cardinality. Looking at the Apache Spark documentation, distributed computation over such tables is not possible in Spark, constrained to work with tables that do not have too many categories. This R script deals with, say, all Wikidata labels across all Wikidata languages. That is the reason why I have not moved the whole thing to the cluster, which I would do [15:47:44] otherwise. I simply have no other way to compute what I need then to go for in memory processing with R. Sorry. [15:48:21] RECOVERY - Check the last execution of refinery-import-page-history-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:48:27] RECOVERY - Check the last execution of refinery-import-wikidata-all-ttl-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-wikidata-all-ttl-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:53:31] elukey: I understand that the OS killed the process. It is not personal :) I just have to update the Wikidata Language Analytics this way or the other. [15:54:32] 10Analytics: Refine should DROP IF EXISTS before ADD PARTITION - https://phabricator.wikimedia.org/T246235 (10JAllemandou) Nice catch!! [15:55:39] (03CR) 10Nuria: [C: 03+2] Fix webrequest host normalization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572726 (https://phabricator.wikimedia.org/T245453) (owner: 10Joal) [15:55:55] thanks nuria for the bug catching --^ [15:57:04] 10Analytics: Refine should DROP IF EXISTS before ADD PARTITION - https://phabricator.wikimedia.org/T246235 (10Nuria) >They had to be manually dropped and re-added to get them to pick up the new and proper table schema. I am missing here some concepts cause i just did not know that partitions of a table could ha... [15:58:25] the /srv/published/datasets/wmde-analytics-engineering/wdcm/Sitelinks directory on stat1005 is still not mapped to https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/ - the files were created at 13:38 on stat 1005. [16:04:53] mforns: holaaa I have updated the RU code review [16:06:40] joal: Mediawiki wikitext 'history' or 'current' [16:07:11] GoranSM: what in particular is not synced? (so I can check what's wrong) [16:07:21] djellel: I don't understand your question [16:07:41] GoranSM: in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/Sitelinks/ I see last modified 13:38 of today [16:08:06] joal: I found two table. mediawiki_wikitext_current and mediawiki_wikitext_history [16:08:12] Indeed djellel [16:08:20] there are two tables [16:08:59] the link you ref earlier describes mediawiki_wikitext_history [16:10:08] djellel: correct - you can also find https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/XMLDumps/Mediawiki_wikitext_current [16:10:41] fyi updated hosting and details for turnilo administration, looking into Andy's issue now (https://wikitech.wikimedia.org/w/index.php?title=Analytics%2FSystems%2FTurnilo&type=revision&diff=1857231&oldid=1850004) [16:11:54] joal: do you have handy the task that we used to ask the labstore mountpoint on stat1006/1007 by any chance? [16:11:58] otherwise I'll look for it [16:12:02] (need to ask the same for launcher1001) [16:12:11] nope elukey :( [16:12:29] I'm sure it exists, I'll try to help finding it [16:12:34] np :) [16:13:16] AndyRussG: was normalized_request_count ever working through Turnilo? I don't see it in the config at all, neither now nor removed in the history [16:13:25] (I didn't look super closely) [16:13:41] milimetric: hi! yep, it definitely used to work :) [16:14:40] needed to get accurate counts for all the non-Fundraising campaigns, since those use client-side sampling for calls to /beacon/impression [16:14:51] thx much! [16:14:52] huh... it's been like this for 2 years since we moved from pivot: https://github.com/wikimedia/puppet/blame/ef99835a63e71d5a1ebf5fa8c8a191b1c75fc7d4/modules/turnilo/templates/config.yaml.erb#L789 [16:14:57] joal: do you have a simple example that uses mwparserfromhell on mediawiki_wikitext_current by any chance? [16:15:03] nope [16:15:17] djellel: we have not used that parser on the cluster yet [16:15:37] anyway, mforns/joal did you by any chance remove normalized_request_count from the banner_impressions_minutely turnilo config? I don't see that it was ever there, but checking just in case [16:15:40] milimetric: oh Pivot vs Turnilo right... hmm I don't think it's been that long since it stopped working, but I could be wrong [16:15:52] OK, np, thanks joal [16:15:54] can't recall milimetric [16:16:18] AndyRussG: oh, ok, if it might have been that long then that makes sense because with Pivot we were auto-detecting settings and when we switched we needed more control [16:16:19] joal: milimetric: I guess it's possible it worked in Pivot but not Turnilo [16:16:30] k, adding in any case [16:16:35] thanks so much! [16:22:27] aaargh turnilo, a screw (tornillo)... I just can't un-see it! [16:24:07] 10Analytics, 10Analytics-Cluster, 10Cloud-Services, 10Operations: notebook1003 failed network mount on boot - https://phabricator.wikimedia.org/T204857 (10elukey) 05Open→03Resolved a:03elukey [16:24:14] djellel: I remind you to use small wikis for test when dealing with text please - your currently running test query needs 30k mappers ... [16:24:22] djellel: please :) [16:24:59] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review: Turnilo no longer showing sample-rate adjusted data for banner activity - https://phabricator.wikimedia.org/T241162 (10Milimetric) [16:25:14] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog, 10Patch-For-Review: Turnilo no longer showing sample-rate adjusted data for banner activity - https://phabricator.wikimedia.org/T241162 (10Milimetric) a:03Milimetric [16:25:40] joal: damn, no salvation with LIMIT [16:25:48] nope [16:26:22] djellel: you have a "where" clause other than partition, this means computation on the data [16:26:40] djellel: if you remove that where clause, job will be small [16:28:38] djellel: maybe python-mwtext is an option ? https://github.com/mediawiki-utilities/python-mwtext/blob/master/mwtext/wikitext_preprocessor.py [16:37:00] milimetric: I applied your suggestions in the language selector patch, it's ready to merge :) [16:37:13] we already have turkish! [16:37:33] sweet, will look... probably quite a bit later due to meetings and other duties [16:37:42] turkish! So so cool [16:52:23] milimetric, I don't recall removing normalized_counts [16:52:35] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Add pageviews total counts to WDQS - https://phabricator.wikimedia.org/T174981 (10Yurik) @Gehel lets define `this amount of data`, just for clarity. My back-of-the-envelope calculations: * each pageview statistics statement is a counter (8 b... [16:53:35] must be what you said [16:58:57] dcausse: Hello! would you have aminute for me? [16:59:55] joal: in a meeting [17:01:01] ping ottomata [17:02:21] ping mforns hola [17:03:34] dcausse: let's see if we manage to find each other before end-of-day, if not, tomorrow :) [17:15:15] (03CR) 10Ottomata: [V: 03+2 C: 03+2] "We'll need to deploy these manually after merge." [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/574710 (https://phabricator.wikimedia.org/T245897) (owner: 10Joal) [17:17:22] (03CR) 10Milimetric: [C: 04-1] Add language selection functionality to Wikistats (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/564047 (https://phabricator.wikimedia.org/T238752) (owner: 10Fdans) [17:20:17] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Give clear recommendations for Spark settings - https://phabricator.wikimedia.org/T245897 (10Ottomata) Merged and deployed updated spark kernels on notebook1003 and notebook1004. [17:23:22] 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [17:25:44] GoranSM: we need you at some point in #wikimedia-research . We have a (joint) office hour now and there is a WD related question. I've already told the person about you, I just need you there to officially connect. :D [17:30:22] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Add pageviews total counts to WDQS - https://phabricator.wikimedia.org/T174981 (10Nuria) I think before talking about bytes you need a use case, what is the use case here? As we mentioned earlier the GLAM folks care about human pageviews (real... [17:35:31] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Desktop Improvements, and 6 others: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 (10Nuria) a:03Ottomata [17:53:33] so I have tested spark on a host in test different than an-tool1006 [17:53:38] all works [17:53:43] * elukey cries in a corner [17:58:30] joal: sorry, have to go, will ping you tomorrow morning [17:58:38] np dcausse see you :) [17:59:00] I have reinstalled spark2 on an-tool1006 and now it works [17:59:12] ...................... [17:59:18] * joal sends sparkylove to elukey :S [18:00:50] milimetric: for the dumps, the coordinator.properties and .xml refer to mw_raw_directory and mw_private_directory. Look at line 24,26: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/bucketed/coordinator.xml . If I no longer reference the mw_project_namespace_map, is there any need for those mw_raw_directory and mw_private_directory properties? [18:06:18] lexnasser: I believe the mw_private_directory is where it finds the monthly geoeditor numbers and you still need those. But the raw directory is no longer needed after you get rid of mw_project_namespace_map [18:07:14] lexnasser: that's why it's organized like this in the prop file: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/mediawiki/geoeditors/bucketed/coordinator.properties#L50 [18:09:53] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Give clear recommendations for Spark settings - https://phabricator.wikimedia.org/T245897 (10nshahquinn-wmf) Thank y'all so much! This is very helpful, and the consistency really helps reduce the mental overhead for us. @JAllemandou... [18:15:58] milimetric: I'm probably misunderstanding something, but what do I need the mw_private_directory for? I thought the monthly geoeditor numbers (I assume you mean counts) are from the geoeditors_monthly_table parameter? [18:22:56] * elukey off! [18:26:00] * ottomata running errannd back in a bit [18:28:01] lexnasser: you’re right :) So, the table is just an abstraction in Hive, when you query that’s what you need. But for oozie to know when a new snapshot is ready, it has to know which dataset to look at for the _SUCCESS flag. That’s why you define the dataset directory and use the dataset as an input event [18:30:00] milimetric: Got it, thanks for the response! Also one more quick question: I'm thinking of naming the new hive table that the API (Cassandra) will load from and the dumps will query from `wmf.geoeditors_monthly_public`. Do you think this name is fine or do you think something else would be better? [18:31:14] lexnasser: sounds good to me, makes sense [18:31:25] milimetric: Great, thanks! [18:32:30] lexnasser: and after learning about datasets you’ll probably soon have to deal with writing your own dataset for the new table, so the druid oozie job can know when to run [19:03:29] (back) [19:11:54] elukey: whaaaa [19:11:55] ??? [19:12:04] weird! [19:12:46] elukey: What is not synced - or at least it appears not synced to me - is that under https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/ I cannot see the Sitelinks directory. [19:14:01] elukey: Precisely: going directly to https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm/Sitelinks/, as you have done, really shows the updated files, but going to https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/wdcm does not show the Sitelinks directory. What could be the case? [19:35:23] 10Analytics: refinery-drop-older-than failing tests - https://phabricator.wikimedia.org/T246272 (10Ottomata) [19:50:58] elukey: Everything seems to be in sync w. published datasets from stat1005 now. [20:01:22] 10Analytics: Should reportupdater Pingback reports be refactored? - https://phabricator.wikimedia.org/T246154 (10mforns) One thing we could do, as suggested by Dan, is to purge event_sanitized.mediawikipingback by deleting all events that are not the state of the art of a given wiki (remove all but last pingback... [20:20:16] 10Analytics: refinery-drop-older-than failing tests - https://phabricator.wikimedia.org/T246272 (10mforns) I think this might be caused by the following patch +2 by me (my bad) https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/573215/3/bin/refinery-drop-older-than The error that is popping up in the test i... [20:24:57] 10Analytics: Should reportupdater Pingback reports be refactored? - https://phabricator.wikimedia.org/T246154 (10CCicalese_WMF) @mforns that sounds like a great idea. [20:27:18] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Add pageviews total counts to WDQS - https://phabricator.wikimedia.org/T174981 (10Yair_rand) Most query results sets meant for human consumption would benefit from having the results sorted by pageviews. Needing to filter for a certain level o... [20:29:11] gone for tonight [20:35:05] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Add pageviews total counts to WDQS - https://phabricator.wikimedia.org/T174981 (10christophbraun) @Nuria WDQS is currently used by the GLAM community to create queries that are beyond the scope of existing tools for a specific purpose as menti... [21:12:22] 10Analytics, 10Inuka-Team, 10Product-Analytics: Set up preview counting for KaiOS app - https://phabricator.wikimedia.org/T244548 (10nshahquinn-wmf) [21:12:24] 10Analytics, 10Inuka-Team, 10Product-Analytics: Set up pageview counting for KaiOS app - https://phabricator.wikimedia.org/T244547 (10nshahquinn-wmf) [21:16:40] 10Analytics, 10Product-Analytics (Kanban): Spark applications crash when running large queries - https://phabricator.wikimedia.org/T245896 (10nshahquinn-wmf) [21:18:57] 10Analytics, 10Product-Analytics (Kanban): Spark applications crash when running large queries - https://phabricator.wikimedia.org/T245896 (10nshahquinn-wmf) >>! In T245896#5917131, @Nuria wrote: > Per our conversation, this can be somewhat alleviated with better settings but hive is a better alternative for l... [21:23:09] mforns: had an IRC chat with timo and learned a bunch of stuff [21:23:25] and i have some ideas on how to make isolate the stream config [21:23:28] yt? can we bc? [21:44:51] ottomata, yes [21:44:57] you still? [21:45:07] ah [21:45:08] yes 1 min [21:45:13] me too 1 min [21:45:54] in bc i actually have to run v soon! let's chat tho [21:48:09] mforns: ^ [21:49:41] omw [22:03:39] ottomata: ping me if you and mforns are done talking [22:03:57] we are in da cave [22:04:02] nuria, ^ [22:33:35] hey a-team: looks like notebook1004 ran out of diskspace again, and I think I'm the one to blame for it [22:34:11] I figured out that R tries to dump to a temp file, which I think fills up the disk [22:34:23] but I'm currently not able to find out where the tempfile is/was and try to delete it [22:38:23] Nettrom: your notbook server gets a private temp file [22:38:31] i think if you restart your noteebook server it will be dleeted [22:39:06] ottomata: ah, thanks, let me try that [22:40:34] that did the trick, magically 35G of disk space available [23:01:59] 10Analytics, 10Inuka-Team: Update EventLogging to accept events from KaiOS app - https://phabricator.wikimedia.org/T246295 (10nshahquinn-wmf) [23:02:53] 10Analytics, 10Inuka-Team: Update EventLogging to accept events from KaiOS app - https://phabricator.wikimedia.org/T246295 (10nshahquinn-wmf) @AMuigai, @hueitan, this is work for Analytics, not for y'all. [23:04:42] 10Analytics, 10Inuka-Team: Update EventLogging to accept events from KaiOS app - https://phabricator.wikimedia.org/T246295 (10nshahquinn-wmf) @hueitan what will the app's user agent look like? [23:08:53] Nettrom: so you know we will have couple notebooks machines with much more disk space in not too long [23:09:03] Nettrom: still, ahem, they will not have infinite disk space