[01:39:06] (03PS1) 10MusikAnimal: Fix usability.wikimedia row to have 3 columns instead of 4 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/546771 [01:56:15] 10Analytics, 10Product-Analytics, 10Patch-For-Review: Start refining ChangesListHighlights events - https://phabricator.wikimedia.org/T212367 (10Ottomata) Backfilled since July 31th: ` 19/10/29 01:34:26 INFO Refine: Successfully refined 1982 of 1982 dataset partitions into table `event`.`ChangesListHighligh... [01:59:43] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Ottomata) I've backfilled the last 90 days of data for ChangesListHighlights from MySQL. This schema doesn't have any whitelist defined, so that should be all we got! Now that... [02:00:16] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Start refining ChangesListHighlights events - https://phabricator.wikimedia.org/T212367 (10Ottomata) [02:02:01] 10Analytics, 10Product-Analytics: Start refining all blacklisted EventLogging streams - https://phabricator.wikimedia.org/T212355 (10Ottomata) @Neil_P._Quinn_WMF, now that T212367 is unblacklisted, can we close this (and that) task? [07:09:51] !log roll restart java daemons on analytics1042, druid1003 and aqs1004 to pick up new openjdk upgrades [07:09:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:10:12] yep we are almost ready to roll restart everything for openjdk upgrades [07:10:26] this time we have cumin cookbooks though! :P [07:27:59] very weird, it seems that aqs1004-a sees part of the cluster as down after the restart [07:28:08] nothing in the logs though [07:28:33] some graphs shows increased read timeouts in other instances, maybe connected to that one [07:28:36] mmmm [07:29:07] ah no it just took ages to finish the startup [07:32:40] okok metrics come back to normal [07:47:45] 10Analytics, 10service-runner, 10User-Elukey: Upgrade service-runner on AQS to unblock rsyslog logging - https://phabricator.wikimedia.org/T236757 (10elukey) [07:48:13] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) 05Open→03Stalled Pending T236757 [08:21:06] Morning [08:21:13] bonjour! [08:21:44] weird restart elukey - I assume it is cassandra [08:22:12] no idea, but didn't cause damaes [08:22:15] *damages [08:22:21] great [08:23:47] removing py2 deps for eventlogging now! [08:23:55] \o/! [08:24:33] we are only using py2 for Hue now [08:24:46] DEPRECATE ! DEPRECATE :)( [08:27:26] (03CR) 10Joal: [V: 03+2 C: 03+2] "Thanks @MusikAnimal!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/546771 (owner: 10MusikAnimal) [08:33:46] rocm is getting packaged in Debian: https://lists.debian.org/debian-devel/2019/10/msg00275.html [08:33:55] https://salsa.debian.org/rocm-team [08:34:27] yessss [08:34:47] Nice! [08:34:51] moritzm: we'll likely try to keep up with upstream I imagine, to have more recent versions [08:35:02] fdans: shall we start per-file backfilling for a week? [08:37:31] joal sounds good, should I start a new coordinator or resume the current one and monitor it? [08:37:45] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging to Python 3 - https://phabricator.wikimedia.org/T233231 (10elukey) [08:38:01] fdans: how you prefer - The other thing is to keep in mind which days have to be restarted becasue of failure [08:38:11] fdans: doing some maths [08:39:29] fdans: If our assumption is correct (cassandra can handle 7 days of load and have compaction happening after), then 7 days loading should take ~4h30 (data from from previous loading) [08:40:05] 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar): Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10elukey) >>! In T234808#5605049, @Krinkle wrote: > Ack. @elukey Let me know what day/time works best next week. I'll be on US Eastern Time that week. In the EU afternoon a... [08:40:35] joal: I have a list of failed days, so I wouldn't worry about that [08:40:42] fdans: ack fdans [08:41:04] fdans: I think restarting with a new coord is probably safer - We start, then wait, and check [08:41:11] fdans: is that ok for you? [08:41:21] joal: it is, should I kill the big one? [08:41:26] fdans: If you prefer to resume and monitor, I'm happy with that [08:41:38] Yessir - Kill the big one, and we restart one by one [08:41:57] ok [08:43:24] fdans: I checked this: https://wikimedia.org/api/rest_v1/#/Mediarequests%20data/get_metrics_mediarequests_per_file__referer___agent___file_path___granularity___start___end_ [08:43:56] fdans: It's good, file_path has double-quotes escaped ;) [08:44:15] joal: but that's per file, so it's expected [08:44:22] I didn't do that in the first place for tops [08:44:29] per file has all urls url-encoded [08:44:30] fdans: had we tested it? [08:44:34] Ohhhh [08:44:35] yes [08:44:38] Didn't know [08:44:55] We store in cassandra URL-encoded, and then decode? [08:45:20] I should have been more thorough in CR [08:46:24] wait let me think [08:47:04] joal: I can't remember from the top of my head if urls in cassandra per file are already encoded of if we do it on the fly in aqs [08:47:22] but per file isn't the problem, tops is [08:47:53] I know per-file is not the problem now - But I prefered to double check before continue loading [08:48:07] joal: totally makes sense :) [08:51:13] !log starting backfilling for per file mediarequests for 7 days from Sep 15 2015 [08:51:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:57:18] fdans: to be sure, I have triple checked in cassandra - The file_path is stored as: /wikipedia/commons/5/53/Google_"G"_Logo.svg [08:57:58] fdans: We decode URL-encoded file-path, request in cassandra, and AQS by itself escapes the value to send json back [09:01:04] joal: I just tried per file with that url, seems to work fine for august 2019 [09:01:19] it works fine indeed fdans - tested that [09:01:30] Just made sure I understood who does what :) [09:22:49] I am currenly on myfi, my internet connectivity is no bueno today [09:23:36] k elukey [09:30:22] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts - https://phabricator.wikimedia.org/T234229 (10MoritzMuehlenhoff) >>! In T234229#5593857, @elukey wrote: > @MoritzMuehlenhoff @ArielG... [09:35:58] fdans: looks like your backfilling job has not started [09:36:46] joal: that's because I'm an uncaffeinated idiot [09:37:30] fdans: uncaffeinated is probably true, for the rest I wouldn't agree :) [09:41:02] joal: ok going https://hue.wikimedia.org/oozie/list_oozie_coordinator/0049549-190918123808661-oozie-oozi-C/ [09:43:23] Thanks fdans :) [09:48:59] joal: so in theory past Luca added some nice puppet automation for the TLS certs, that allows the deployment of trustores/etc.. and ssl xml config files before enabling the encrypted shuffle [09:49:25] (03PS1) 10Fdans: Escape double quotes in file urls [analytics/refinery] - 10https://gerrit.wikimedia.org/r/546882 [09:49:26] <3 past Luca ! [09:49:35] then IIRC the procedure should be to enable mapreduce.shuffle.ssl.enabled and then roll restart the yarn daemons [09:50:52] I hope current Luca is in sync with past Luca :D [09:51:18] also, we need those certs [09:51:22] I forgot about this bit [09:51:23] # This is required to allow the datanode to start: [09:51:24] # https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/SecureMode.html#Secure_DataNode [09:51:27] dfs.http.policy: 'HTTPS_ONLY' [09:52:17] I am going to prepare and deploy the TLS config today, and then in theory on Thu we can enable it joal ? If you have time [09:52:37] Yessir elukey :) [09:52:39] that's great [09:52:43] it is the day before a holiday so probably not super good [09:52:46] what do you think? [09:53:08] elukey: If we go for it in the morning, I think we're fine [09:53:26] elukey: It either work first try, or we rollback [09:53:28] all right, and if we see issues we just rollback [09:53:32] ack ack [09:53:38] all right going to work on it :) [09:53:50] elukey: we can try it on test cluster today if you want [09:54:09] and by the way elukey, I'll be on regular schedule tomorrow (kids in holidays) [09:54:43] joal: it is already enable in the test cluster [09:54:50] ah! then even tomorrow is fine [09:54:59] *enabled [09:57:14] ack elukey :) [09:57:50] elukey: AQS mediarequest loading is back (1 week of data only) [09:57:59] elukey: in case you hadn't followed ;) [10:01:29] (03CR) 10Joal: [C: 03+1] "LGTM - Let's test and merge :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/546882 (owner: 10Fdans) [10:05:17] super [10:05:27] if anything breaks we can blame fdans [10:05:28] :D [10:06:30] elukey: excuse me elukey how dare you [10:08:23] ahahaha [10:13:51] 10Analytics, 10Product-Analytics: Start refining all blacklisted EventLogging streams - https://phabricator.wikimedia.org/T212355 (10Neil_P._Quinn_WMF) 05Open→03Resolved >>! In T212355#5613779, @Ottomata wrote: > @Neil_P._Quinn_WMF, now that T212367 is unblacklisted, can we close this (and that) task? Eve... [10:21:22] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Start refining ChangesListHighlights events - https://phabricator.wikimedia.org/T212367 (10Neil_P._Quinn_WMF) 05Open→03Resolved a:03Neil_P._Quinn_WMF Looks good! Since Growth doesn't want the data anymore, I filed {T236770}, wh... [10:21:24] 10Analytics, 10Product-Analytics: Start refining all blacklisted EventLogging streams - https://phabricator.wikimedia.org/T212355 (10Neil_P._Quinn_WMF) [10:29:59] taking an earlier lunch break for some errands, ping me on the phone if needed :) [10:53:57] 10Analytics, 10Community-Tech, 10Product-Analytics (Kanban): Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10aezell) @MaxSem Thanks for that. Turns out that it's not specifically just using EL but having whitel... [13:09:51] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Neil_P._Quinn_WMF) >>! In T159170#5613776, @Ottomata wrote: > I've backfilled the last 90 days of data for ChangesListHighlights from MySQL. This schema doesn't have any whitel... [13:19:11] elukey: would you be nearby? [13:23:32] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Start refining ChangesListHighlights events - https://phabricator.wikimedia.org/T212367 (10Ottomata) > Looks good! Since Growth doesn't want the data anymore Ha, oh ok. [13:24:06] joal: yep! [13:24:28] elukey: I have a question on using the file op in puppet [13:24:37] sure [13:24:59] elukey: Can I add variable to the parameterization of the file op, or should they be defined in context? [13:25:26] elukey: I'm asking because I want to generate 2 files from the same template, meaning the same parameter should have 2 values [13:26:06] joal: and the files have different paths right? [13:26:10] And ultimately my question is: should I do file { 'template': param => value } or update param = value before the file operation [13:26:34] what is the parameter that changes? [13:26:37] elukey: same template, 2 different files [13:26:46] elukey: let's batcave if you want :) [13:26:55] yes better [13:31:38] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Start refining ChangesListHighlights events - https://phabricator.wikimedia.org/T212367 (10Neil_P._Quinn_WMF) >>! In T212367#5615061, @Ottomata wrote: >> Looks good! Since Growth doesn't want the data anymore > Ha, oh ok. At least t... [13:34:29] ottomata: Hello - Would ou have a minute in da cave for a puppet qauestion? [14:34:12] (03PS8) 10Awight: New reports for Reference Previews [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/542419 (https://phabricator.wikimedia.org/T214493) [14:35:02] joal: heya sorry! [14:35:02] just saw ping [14:35:03] yayarrrr [14:35:25] hi ottomata - No big deal, I'm trying something, we'll see if it works :) [14:37:04] (03CR) 10Awight: "PS 8: discovered the `funnel` parameter to enable multi-line results" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/542419 (https://phabricator.wikimedia.org/T214493) (owner: 10Awight) [14:38:44] joal can chat in 5ish mins [14:42:10] 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar): Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10elukey) [14:44:17] joal: yhia [14:44:20] bc? [14:44:29] sure ottomata - elukey might wish to join :) [14:44:45] nono please go ahead :) [14:44:58] ottomata: I need to ask you one thing about cergen when you have a min [14:45:35] 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar): Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10elukey) a:05Krinkle→03elukey [14:46:10] 10Analytics, 10Analytics-Kanban: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10awight) This must be my fault, but I thought I would mention it anyway, at least until I understand what's wrong. I get this python2-backcompat-looking error when running reportupdater on... [14:50:42] elukey: sure whassssup [14:50:42] ? [14:50:55] 10Analytics, 10Analytics-Kanban: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10elukey) 05Resolved→03Open @awight this is probably something that we didn't test it, as far as I know we don't use the graphite writer.. the stacktrace makes sense, a text string should... [14:50:57] 10Analytics-Kanban: Deprecate Python 2 software from the Analytics infrastructure - https://phabricator.wikimedia.org/T204734 (10elukey) [14:51:20] oh joal i do have a q! wahht's up with your event table spreadsheet? [14:51:29] (it was shared with me so I saw it) [14:51:32] joining back :) [14:52:50] ottomata: I was reviewing the settings for TLS in hadoop test, and I noticed that at the time when I created the self signed cert for the CA the key password has been put as truststore password [14:53:08] that is what I can see in https://github.com/wikimedia/cergen/blob/f75d4bc22a16c5b320a13a4355e9c9233b6f96f2/cergen/certificate.py#L541 [14:53:38] in theory shouldn't be the pass different? The key should have one and the trustore (containing only the certificate) another one? [14:53:48] but I may miss something in here [14:57:49] 10Analytics, 10Analytics-Kanban: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10awight) +1 it does look like a bug. The solution would be as simple as `.encode('utf-8')` (I ran that locally and it works, at least) however I haven't been able to verify which encoding t... [14:59:12] 10Analytics, 10Analytics-Kanban: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10elukey) >>! In T204736#5615474, @awight wrote: > +1 it does look like a bug. The solution would be as simple as `.encode('utf-8')` (I ran that locally and it works, at least) however I hav... [15:01:31] elukey: probably! for the most part though, since the pws are stored in the yaml manifest file itself, it seemed irrelevant [15:01:50] would accept patch that would change that [15:01:51] like [15:01:54] 10Analytics: Find a strategy to mitigate small-files handling for long-term kept events - https://phabricator.wikimedia.org/T236794 (10JAllemandou) [15:01:56] 10Analytics: Find a strategy to mitigate small-files handling for long-term kept events - https://phabricator.wikimedia.org/T236794 (10JAllemandou) [15:01:58] 10Analytics: Partition event-data daily instead of hourly (for sanitized data) - https://phabricator.wikimedia.org/T217350 (10JAllemandou) [15:02:08] like, if a trustore pw is provided in manifest [15:02:12] use it for truststore, [15:02:15] otherwise just use key pw [15:02:34] hey yall, I’m still not feeling super well, was up with Ada for a few hours last night, will try to work a bit but taking it easy [15:03:00] ottomata: yep yep no big deal, I was only trying to wrap my head around it :) [15:03:35] my main issue up to now is that the truststore pass will need to go in a xml files so hadoop will be able to read it [15:04:22] elukey: Thanks for all the help! This doesn't quite fit into our Phabricator threads, so I thought I would ask here: if graphite is being deprecated, what's the suggested way to export my reportupdater data into a public dashboard? [15:04:47] awight: please feel free to ask anything in here :) [15:05:09] * awight cranks my spamming up to "7" ;-) [15:05:10] awight: in theory there is no clear deprecation timeline yet, so we can use it, but eventually we'll likely have to move it elsewhere [15:05:23] 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar): Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10Krinkle) 05Open→03Resolved [15:05:38] so we can keep it as it is now, but let's remember that at some point it will need to be moved [15:05:50] +1 no problem for my current project since it's a short-term product decision thing. But I'd still prefer to be learning a future-resistant platform [15:06:01] 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar): Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10Krinkle) I've restarted `coal, `navtiming` and `stats` on webperf1001 and webperf2001 and verified in the journals that the restarts happened and that they resumed their d... [15:06:17] awight: ack, let's keep it for now and add the encode part to see how it works [15:07:36] I'm happy to take the easy way out for this, for sure. I just missed docs about what the alternative might be--are other teams just pulling .tsv files from `outputs` and processing privately? [15:08:17] probably yes, Marcel is the best poc since I have no idea [15:09:39] awight: the other way would be dashiki for things like: https://analytics.wikimedia.org/dashboards/browsers/#all-sites-by-os [15:10:42] awight: reportupdater is mostly visualized with dashiki: https://language-reportcard.wmflabs.org/#projects=ptwiki,idwiki,eswiki,viwiki,ukwiki,dewiki,trwiki,ruwiki,frwiki/metrics=Content%20Translation [15:10:50] awight: for public dashboards [15:15:43] (03CR) 10Joal: "See the comment inline" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/543897 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:18:17] (03CR) 10Joal: [C: 03+1] "Looks good :) Thanks Andrew" (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/543897 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [15:22:38] nuria: Thanks for the pointer! I haven't made the graphs yet, so this is still an option. [15:31:20] (03PS1) 10Awight: Leftover py2-3 glitch: encode a string to bytes [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/546971 (https://phabricator.wikimedia.org/T204736) [15:38:45] (03CR) 10Elukey: Leftover py2-3 glitch: encode a string to bytes (032 comments) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/546971 (https://phabricator.wikimedia.org/T204736) (owner: 10Awight) [15:42:16] (03CR) 10Mforns: [C: 03+1] "LGTM!" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/546971 (https://phabricator.wikimedia.org/T204736) (owner: 10Awight) [15:42:44] (03CR) 10Nuria: [C: 03+2] Fix usability.wikimedia row to have 3 columns instead of 4 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/546771 (owner: 10MusikAnimal) [15:48:38] (03PS9) 10Awight: New reports for Reference Previews [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/542419 (https://phabricator.wikimedia.org/T214493) [15:50:13] (03CR) 10Awight: "I was able to verify that the baseline report runs and correctly publishes its data to Graphite. The concept is the same for the other tw" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/542419 (https://phabricator.wikimedia.org/T214493) (owner: 10Awight) [15:54:30] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10mforns) >> In T231858#5612967, @Ottomata wrote: >> Ahhh ok right. Right. >>> sanitize data in the log databases >> This should... [15:57:16] 10Analytics: Superset + Turnilo access for Verena Lindner + Raja Gumienny (WMDE) - https://phabricator.wikimedia.org/T231677 (10Verena) Yes, I cann access Turnilo and Superset both. Thank you. [16:00:24] ping fdans , mforns , milimetric standup? [16:01:49] (03CR) 10Awight: [C: 04-1] Leftover py2-3 glitch: encode a string to bytes (032 comments) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/546971 (https://phabricator.wikimedia.org/T204736) (owner: 10Awight) [16:01:57] oops [16:03:24] sorry still not feeling well [16:03:34] cc nuria ^ [16:03:43] milimetric: got it! [16:19:41] (03CR) 10Awight: [C: 04-1] Leftover py2-3 glitch: encode a string to bytes (031 comment) [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/546971 (https://phabricator.wikimedia.org/T204736) (owner: 10Awight) [16:38:07] (03PS2) 10Awight: graphite.py: encode a text string before socket.send [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/546971 (https://phabricator.wikimedia.org/T204736) [16:43:58] (03CR) 10jerkins-bot: [V: 04-1] graphite.py: encode a text string before socket.send [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/546971 (https://phabricator.wikimedia.org/T204736) (owner: 10Awight) [16:45:46] (03CR) 10Elukey: "New tests seems failing :(" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/546971 (https://phabricator.wikimedia.org/T204736) (owner: 10Awight) [16:50:52] fdans: the scaping! [16:51:09] fdans: I think this is fine but so we know [16:51:09] nuria: oh right, back to bc? [16:51:19] fdans: the way we are doing will insert data like Google_\"G\"_Logo.svg [16:51:30] yes [16:53:47] nuria, BTW, is it OK to delete the current data_quality_hourly table with all useragent entropy data? [16:54:15] it can be recalculated, but maybe someone wants it available until then? [16:54:38] mforns: can it be recalculated from navtiming data? [16:54:47] mforns: that we retain? [16:54:55] fdans: ok [17:05:28] nuria, no, useragent is nullified in event_sanitized, we only have 3 months of that [17:05:36] good point [17:09:17] 10Analytics: Superset + Turnilo access for Verena Lindner + Raja Gumienny (WMDE) - https://phabricator.wikimedia.org/T231677 (10Nuria) 05Open→03Resolved [17:11:47] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Rerun sanitization before archiving eventlogging mysql data - https://phabricator.wikimedia.org/T236818 (10Nuria) [17:13:39] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Nuria) I have created a task about re-running sanitization: https://phabricator.wikimedia.org/T236818 Sanitization is needed... [17:14:26] mforns: then let's move the data aside so we can insert it on the new tables but let's not loose the data we used for our study [17:14:39] nuria, ok [18:02:22] fdans: the jobs to load data daily for mediarequests are happening as well right? those we have not stopped, correct? [18:02:53] nuria: correct [18:12:22] fdans: i think you and joseph did a calculation of capacity for cassandra before starting loading mediarequest [18:12:36] fdans: did that get written to wikitech ? [18:12:48] fdans: i can see cassandr ainstances at 40% capacity [18:13:29] fdans: this makes me think we should start loading most recent [18:13:35] fdans: rather than 4 years ago [18:14:19] nuria: [18:14:29] since it's going to take so long i think that's the best idea [18:14:45] once these 2 weeks are complete I'll load from May 2019 backwards [18:15:36] fdans: ok, let's also document the capacity planning [18:16:02] fdans: and let's document loading in wikitech [18:21:16] * elukey off! [18:28:39] fdans: do you know about nodetool and how to look compaction stats? [18:54:54] 10Analytics: logging level of cassandra should be warning or error but not debug - https://phabricator.wikimedia.org/T236698 (10Nuria) Setting of INFO logging level: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/cassandra/daily/workflow.xml#L254 But it might not be this cause definitely we h... [18:59:46] \o/ I'm back [19:00:32] 10Analytics: logging level of cassandra should be warning or error but not debug - https://phabricator.wikimedia.org/T236698 (10Nuria) Also, oozie is logging at info level, this is probably what we want: [nuriaruiz@nurieta][/workplace/operations/puppet]$ more ./modules/cdh/templates/oozie/oozie-log4j.properties.... [19:01:08] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10mforns) @Ottomata @Nuria Another solution is archiving only the sanitized data. I don't think having the last 90 days of data... [19:04:40] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Nuria) You are right that it is not critical but deleting it selectively from all tables is probably more work than sanitizing... [19:18:05] fdans, nuria: About cassandra loading, I think we should look for a loading period allowing not to interfer with daily loading of pageviews [19:18:42] fdans, nuria: joint loading of big stuff is bound to make the system suffer [19:19:37] fdans, nuria: To me a 20 days period is probably a good spot, allowing to start after pageview-compaction-recovery and should leave time for mediarequests-compaction recovery before getting into the pageview stuff for new da [19:45:58] 10Analytics, 10Wikimedia-Stream, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, 10Services (watching): EventStreams process occasionally OOMs - https://phabricator.wikimedia.org/T210741 (10Ottomata) p:05Triage→03Low [20:03:19] (03CR) 10Ottomata: [C: 03+2] Add HDFSCleaner to aid in cleaning HDFS tmp directories [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/543897 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [20:23:54] Heya nuria - I'm assuming the error we got from oozie on cassandra-daily-coord-local_group_default_T_unique_devices was you testing the log-level, right? [20:24:14] joal: ah yes, just correcting that NOW [20:24:37] no prob nuria, just wanted to confirm :) [20:29:12] also nuria, do you have comments about my views on cassandra loading? [20:30:45] joal: views? [20:31:02] my comments above I mean :) [20:31:19] joal: ahhhh [20:34:03] joal: i think we need to test it, in the issues on friday and saturday I think what was happening was that cassandra had a major backlog as f.dans had started loading years of data at once [20:36:10] joal: loading 20 days might be the sweet spot but we need to try it out. I do not think loading pageviews and mediarequests at the same time is necessarily problematic , probably depends on the backlog of compaction that cassandra already has when that happens, makes sense? [20:36:42] nuria: IMO multiple big loading is problematic [20:37:51] We already have 2 big loading in parallel (pageview + mediarequest) happening around 01:00 UTC - Adding a third one is not a good thing IMO [20:39:10] joal: wouldn't it depend on hwo big is the third one? see: https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_lgc_pagecounts_per_project&var-table=data&var-quantile=99p&from=1569789544487&to=1572381544488 [20:39:36] joal: given alarms clearly doing twice as much is not an option [20:39:53] nuria: for sure it depends on how much we load - per-file, as in per-article, is realtively big [20:39:58] joal: actually more like 2.5 as much [20:44:14] nuria: When looking at charts between 01:00UTC loading on 23rd and 24th of October I can't see differences, but on 23rd jobs failed, while on 24th they succeeded [20:45:26] Anyway - You have my points :) [20:46:19] 10Analytics, 10Analytics-Kanban: Check Avro as potential better file format for wikitext-history - https://phabricator.wikimedia.org/T236687 (10JAllemandou) I have generated avro files for 2019-09 dumps, and ran quite some queries on them with limited amount of RAM needed per executor for spark, and without is... [20:46:41] joal: yaya, i think we agree, we have to be cautious and look and see how loading is affecting system [20:46:51] for anyone interested in weird stuff: https://phabricator.wikimedia.org/T236687#5617392 [20:46:57] joal: I think you did teh capacity planning with francisco early on right? [20:47:09] joal: i can see we are at 40% in both cassandr ainstances [20:47:12] I think we did that with Dan [20:49:16] joal: for mediarequests? [20:49:28] nuria: yes [20:49:35] joal: on your ticket did you try SELECT count(1) FROM joal.mediawiki_wikitext_history_avro where snapshot = '2019-09' and wiki_db = 'frwiki and compared counts? [20:49:54] yes I did - I should have mentionned [20:49:58] joal: is it documented anywhere in wikitech? (cc milimetric ) [20:50:08] also, numbers with rlike are the same [20:50:23] I can't recall if we documented the sizing we did [20:51:03] currently on aqs1004: pageview-per-article takes 913G, and mediarequest-per-file 178G [20:52:30] And my remembering from loading cassandra long time ago for pageviews is that compaction is actually very powerfull - Meaning early loading stages generate a lot more than final ones as keys are already present [20:54:23] ok - logging off for now - Will monitor cassandra loading tomorrow morning [20:54:31] joal: ok! [20:54:33] Thanks nuria for the brain bounce [20:55:27] nuria: if the capacity planning is on wikitech, I don’t know where, but I do remember we had at least a year before we had to worry [20:55:47] milimetric (cc joal) let's make sure to document that [20:58:27] 10Analytics, 10Fundraising-Backlog, 10Fundraising Sprint Usual Subscripts, 10Fundraising Sprint V 2019: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10DStrine) [21:11:17] What would we need to get pageviews on a Cloud VPS wiki? [21:23:39] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Create a reports directory under analytics.wikimedia.org - https://phabricator.wikimedia.org/T235494 (10Ottomata) > It turns out it is possible to remove a file from the public folder simply by deleting it from the source folder. How... [21:26:26] 10Analytics: Request for a large request data set for caching research and tuning - https://phabricator.wikimedia.org/T225538 (10Danielsberger) Hi @lexnasser , Let me first answer why we need a timestamp field. At a high level, the goal of most caching research project is to come up with a new algorithm and th... [21:34:28] Pharos: what is a cloud vps wiki? [21:36:23] nuria https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction [21:37:08] Pharos: iI think we know what is cloud services but what do you mean with a "cloud vps wiki"? [21:39:02] Pharos: wikis are hosted elsewhere than cloud , are you thinking "pageviews for a wikitech wiki page" (or similar?) [21:40:36] nuria I mean this wiki specifically: https://wikispore.wmflabs.org/wiki/Main_Page [21:42:03] It's a little unusual because it is an experiment hosted on cloud vps [21:42:45] Pharos: i see, a wiki installed in a lab host. There are no pageviews for those, we provide pageviews for all production wikis hosted by WMF but not experiments or similar installed elsewhere [21:43:42] ok, thanks, I guess we'll try to install an extension to provide some of the functionality [21:44:06] we would like it to be a production wiki someday, but we are not there yet :) [21:46:48] Pharos: icubator? [21:47:18] Pharos: incubator? we have some upcoming ideas to count pageviews of incubator wikis but that's the closest I think of. [21:47:37] @nuria It's kind of like incubator, but for new genres of wikis, not new languages [21:48:10] like trying to start a wiki for oral history [21:48:11] Pharos: ya, i understand [21:48:37] Incubator has its own rules, so we started something small and new [21:49:23] Pharos: sounds god but the best i think you would be able to do in cloud vps is apache logs [21:49:32] Pharos: for now [21:49:43] Pharos: also you coudl run in cloud an instance of piwik [21:49:53] Pharos: and use it to give you data [21:50:13] ok, thanks very much for these suggestions [21:50:24] Pharos: piwik is now called amatomo: https://matomo.org/ [21:50:27] *matomo [21:50:35] Pharos: super easy to install, php + mysql [21:51:10] since it's an experiment, we will try a few different mini-projects, and it will be helpful to see if aany of them get significant views [21:51:25] Pharos: and you add a snipet like [21:51:41] Pharos: and that beacon will report pageviews [21:52:15] Pharos: that i think would be easy and convenient [21:54:42] wonderful, thanks nuria! [21:55:00] Pharos: let me point you to puppet to install piwik/matomo [21:56:11] Pharos: https://github.com/wikimedia/puppet/blob/production/modules/matomo/manifests/init.pp [22:17:56] 10Analytics, 10Analytics-Kanban: logging level of cassandra should be warning or error but not debug - https://phabricator.wikimedia.org/T236698 (10Nuria)