[07:38:43] hello team! [07:38:58] I am going to attempt another el-py3 deployment [07:50:53] everything looks good! [07:50:55] \o/ [08:21:34] Yay elukey! py3 for the win :) [08:21:54] \o/ [08:22:11] joal: bonjour! [08:22:34] Bonjour Luca, all good? [08:22:47] yep! and you ? [08:23:56] Yes! [08:28:28] every time that I am at one apachecon I get excited with Flink, this time I'll study it a bit more :) [08:28:59] :) [08:29:17] elukey: please let me know how you move with that, I'd love to be able to follow you! [08:29:40] first thing is to study it to be able to understand you when you talk about it [08:29:46] that would be massive for me :P [08:29:54] then webrequest on flink! [08:29:56] ahhahaha [08:30:35] * joal will definitely need some studying [08:30:40] :) [08:30:57] ah joal we forgot to mention Ozone I think in the email [08:31:26] I don't think so, I think I provided a link (but no description as an interesting project - we could do so) [08:31:55] ok didn't see it then [08:38:34] elukey: Ah, actually I mentioned the name, but no link [08:45:56] back sorry [08:46:09] we can mention it during standup! Andrew will surely be interested [08:46:34] Yes [08:52:12] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging to Python 3 - https://phabricator.wikimedia.org/T233231 (10elukey) Deployment done, this time I don't see any error reported in logs or metrics! [09:05:06] joal: I am checking cassandra-coord-pageview-per-article-daily, there are a ton of logs stored :D [09:05:22] do we need to have the cassandra driver logging at debug level? [09:05:48] elukey: yes, all cassandra loading jobs generate a ton of logs, we should tweak that indeed [09:08:16] I can see only a timeout in there, but nothing super clear that explains why the load_cassandra step failed [09:08:31] this is why I am asking, maybe having INFO/ERROR would be clearer [09:08:34] also hue fails loading logs :D [09:09:19] I can imagine hue can't load that amount of log :) [09:14:34] !log manual re-run of cassandra-coord-pageview-per-article-daily - 26/10/2019 - as attempt to see if the error is reproducible or not (timeout while inserting into cassandra) [09:14:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:17:11] very interesting joal: https://mvnrepository.com/artifact/ch.cern.hadoop [09:17:33] Nice elukey :) [09:17:44] but might be stale, it says 2018 and 2.7 [09:19:09] hm [09:19:21] in any case, I'd feel more comfortable in depending from an Apache distro [09:19:33] elukey: Thanks for that ;) [09:19:38] elukey: so would I [09:23:25] apachecon definitely gave us a good feeling about where should we go :) [09:26:08] elukey: yes ! And a also good info on Airflow [09:28:38] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - https://phabricator.wikimedia.org/T236655 (10Asmat78) [09:44:34] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) @mobrovac I just noticed in the description the `newest version of service-runner` part. We currently run 2.6.7, is it enough? [09:45:58] mmm also there are two cassandra-daily-coord-local_group_default_T_mediarequest_per_file running [09:46:42] fdans: --^ [09:46:46] hola :) [09:47:04] one is full of workflows not started [09:47:09] the other one contains some failures [09:48:52] elukey: my bad, I just opened irccloud [09:49:20] there are a lot of workflows because there's a lot of days to backfill from 2015 to 2019 [09:50:12] I'm keeping track of the failed ones, will be restarting them this week as the backfilling advances [09:50:39] sure sure, no issue, just wanted to know if we needed to have two coordinators [09:51:14] (with user analytics I mena) [09:51:18] *mean [10:05:06] hm - This feels like cassandra might be overwhelemed [10:05:51] joal: yeah I think the same [10:06:08] btw welcome back joal and elukey :) I missed yall [10:06:20] Hi fdans :) It's good to be back! [10:07:15] https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=aqs&var-keyspace=local_group_default_T_lgc_pagecounts_per_project&var-table=data&var-quantile=99p&from=now-7d&to=now [10:07:31] The number of pending compactions is what bothers me here [10:07:54] 10Analytics, 10Discovery, 10Event-Platform, 10Wikidata, and 3 others: Log Wikidata Query Service queries to the event gate infrastructure - https://phabricator.wikimedia.org/T101013 (10Gehel) [10:08:46] fdans: Would you please stop backfilling for now, letting cassandra come back to less pending compactions? [10:09:18] +1 [10:09:31] joal: of course, would it be enough to suspend the coordinator, or kill it? [10:09:43] it is probably what is affecting all the cassandra loads as well (like pageviews) [10:09:45] suspending is enough fdans, [10:09:50] understood [10:09:52] elukey: I think it is yes [10:10:42] !log mediarequest per file backfilling suspended [10:10:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:12:08] cassandra-coord-pageview-per-article-daily succeeded btw, I'll reply to alerts [10:12:25] Thanks for that elukey [10:12:36] ok let's monitor cassandra getting back in track [10:12:44] and then restart gently backfilling [10:13:33] fdans: I assume going for smaller abmounts being backfilled will be needed (probably loading a week, wait for compaction, then laod again etc - And we need to tweak the amount we load at once) [10:15:58] joal: I see [10:16:12] that sounds good [10:16:24] fdans: compaction is what costs a lot [10:17:03] joal: would overwhelming the cluster slow loading down? [10:17:30] fdans: I think it would make jobs fail more than slowing down [10:17:38] right [10:17:52] It either works relatively fast, or fails [10:18:04] * fdans cries because the loading job takes so long [10:18:22] fdans: Can you remind me how we are on backfilling (which jobs, how much left)? [10:18:47] joal: we are backfilling from 2015 per file [10:18:51] so it's a lot of data [10:18:53] daily [10:18:55] It is indeed [10:19:04] This is why the cluster is overwhelemed [10:19:09] we're at Sep 2015, 5 days later [10:19:28] I didn't realize that - I thought we were going for top (smaller amount)( [10:19:54] OH i need to talk you about that joal probably in barcave [10:20:09] maybe this evening [10:20:18] fdans: as you want - can be now if ou want [10:21:44] joal: I need to go out for an errand now, let's catch up later if that's ok [10:21:51] ack fdans :) [10:34:08] Wow elukey - Just saw T236327 - Will this happen for our hadoop nodes? [10:34:09] T236327: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 [10:35:09] joal: we'll get 10G with the next round of hw refresh, the newer ones already have 10G [10:35:17] awesome :) [10:36:11] 10Analytics, 10CPT Initiatives (Revision Storage Schema Improvements), 10Epic, 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10JAllemandou) @Nuria : I don't think we use the tables mentioned here. Labs might use them in the background,... [10:39:00] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) We can definitely reimage if it is the best path suggested by SRE, but if possible I'd do it manually (so commenting temporarily the partman recipe) to av... [10:41:23] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10mobrovac) >>! In T219928#5610521, @elukey wrote: > @mobrovac I just noticed in the description the `newest version of service-runner` pa... [11:06:41] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) Fine for me, will have to coordinate with my team on upgrading service-runner first! [11:06:50] for --^ we'd need to upgrade service-runner on aqs [11:07:00] should be a relatively easy deploy [11:32:27] * elukey lunch! [11:41:22] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) From looking at the dashboards, it looks like the entire set of values we wasnt to collect is what is currently display... [13:37:54] 10Analytics: Check Avro as potential better file format for wikitext-history - https://phabricator.wikimedia.org/T236687 (10JAllemandou) [13:38:02] 10Analytics: Check Avro as potential better file format for wikitext-history - https://phabricator.wikimedia.org/T236687 (10JAllemandou) a:03JAllemandou [13:38:22] 10Analytics, 10Analytics-Kanban: Check Avro as potential better file format for wikitext-history - https://phabricator.wikimedia.org/T236687 (10JAllemandou) [13:47:24] I have a question about reportupdater queries... Is there an efficient way to split grafana metrics out by wiki, with a path {_metric}.{wiki} for a single query; or it is reasonable/expected to run the query repeatedly for every wiki of interest; or is splitting metrics by wiki not a thing? [13:47:33] Cos I don't see any examples in the source repo, so far. [13:50:10] awight: mforns is the best poc for this! [13:51:38] elukey: Ah perfect, I'll just ask on the CR then :-) [13:52:19] awight: IIRC you'll need a script in the middle to do the pivoting - But I'm not sure of how data is to be sent to graphite after [13:52:32] elukey: question for you on puppet naming/organization [13:52:58] elukey: We have profile::analytics::refinery::job::import_wikitext_dumps [13:55:01] elukey: And I'd like to add a new job - But, the new job is not importing a wikitext dump, even if the dump is stored at the same place (different format), and I think we could make the current job more parameterized to better fit potential use-cases [13:55:31] elukey: Would renaming the job: import_mediawiki_dumps_to_hdfs something conceivable? [13:56:38] awight: an example of script pivoting data in RU: https://github.com/wikimedia/analytics-reportupdater-queries/blob/master/browser/dynamic_pivot.py [13:56:40] joal: Great, that helps unblock me on writing the query itself. [13:57:29] oh :-) hehe I was going to pivot in Hive, probably drawing some CR attention to the hydroelectric plant I'm helping to boil away a river somewhre. [13:58:00] awight: I got it! the pivoting in python is applied in at the end of the hive quer-script itself!! [13:58:06] awight: see https://github.com/wikimedia/analytics-reportupdater-queries/blob/master/browser/desktop_site_by_os_family_percent [13:58:38] awight: pivoting in hive can be done without bowling too much of the ocean (usually) :) [14:04:08] joal: sure! Even import_mediawiki_dumps would suffice in my opinio [14:04:34] ok elukey :) Will do a CR :) [14:04:38] Thanks [14:05:02] super :) [14:05:07] This is weird, there is documentation in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater#The_graphite_section explaining that the metric string can be split on column key and value from the output--however that would require more than one row of output! Here's an example query that will return > 1 row, [14:05:13] https://gerrit.wikimedia.org/r/#/c/analytics/discovery-stats/+/322007/6/reports/events.sql [14:06:42] awight: https://gerrit.wikimedia.org/r/#/c/analytics/discovery-stats/+/322007/6/reports/config.yaml [14:07:10] awight: graphite key is built using multiple values :) [14:07:54] That conflicts with the wiki page which says, "Regardless of which you choose, you must write code that returns a single data point" [14:08:04] But the events query returns multiple rows... [14:08:39] Don't get me wrong--I want to use this functionality if it exists! Just can't tell if I can rely on it. [14:09:00] awight: given that a metric uses action and feature to define its key, there only is one datapoint per key, no? [14:09:13] I'm misunderstanding I htink :) [14:11:31] Oh! That's perfect, then. I'll tweak the wiki page to possibly clarify... I think that means no pivot is necessary, too. [14:13:07] awight: I give you my understanding, but I'd rather have mforns confirm ;) [14:13:50] +1 cool, I'll ping mforns on the page [14:15:28] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Ottomata) Agree with Luca! As long as Kafka's data is maintained, re-imaging should be the same as a downtime for Kafka. [14:20:56] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Upgrade to Superset 0.35 - https://phabricator.wikimedia.org/T236690 (10elukey) [14:21:45] 10Analytics, 10Better Use Of Data, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 2 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10fgiunchedi) [14:22:27] 10Analytics, 10Better Use Of Data, 10Product-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, and 3 others: Client side error logging production launch - https://phabricator.wikimedia.org/T226986 (10fgiunchedi) [14:24:44] elukey: Quick question again - currently the import of dumps is defined in modules/role/manifests/statistics/private.pp [14:25:28] elukey: I'm assumiong that means that if we have 2 stats machine instead of 1, we'll have 2 imports (instead of 1) [14:25:31] Is that correct? [14:28:06] joal: so you mean if the same role gets applied to multiple stat boxes? [14:28:26] elukey: I assume no currently, but it could in theory, right? [14:29:31] joal: yep yep but it would break other things, we usually don't do it [14:29:47] ok elukey - I was asking out of curiosity [14:29:51] Thanks :) [14:30:06] np! [14:30:18] the stat boxes puppetization is a bit of a maze [14:30:51] we'll hopefully have a unified role one day, maybe that includes jupyter everywhere? [14:31:23] ooooohhh [14:36:24] * elukey looks around and sees a big three headed dog and a big red and white tent staring at him [14:38:53] Wow - interesting - Spark and Hive give different results when filtering an avro table [14:39:19] * joal warms up behing elukey and prepares to help [14:42:27] (03Abandoned) 10Awight: Report for ReferencePreviews popups [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/545869 (https://phabricator.wikimedia.org/T214493) (owner: 10Awight) [14:44:28] ok - we're getting back to normal state in cassandra [14:44:32] can you confirm that elukey ? [14:46:53] yep seems so! [14:53:05] Let's ask fdans to restart backfilling after meetings? [14:55:39] yep [15:02:39] Hi mgerlach - I have a question on new-user diff between webrequest and mediawiki-history [15:03:01] mgerlach: Have you filtered for self-registered users only? [15:18:01] 10Analytics: logging level of cassandra should be warning or error but not debug - https://phabricator.wikimedia.org/T236698 (10Nuria) [15:20:10] (03PS6) 10Awight: [WIP] New reports for Reference Previews [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/542419 (https://phabricator.wikimedia.org/T214493) [15:30:46] (03PS7) 10Awight: New reports for Reference Previews [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/542419 (https://phabricator.wikimedia.org/T214493) [15:32:42] joal: https://indico.cern.ch/event/716795/contributions/2946453/attachments/1645011/2628931/BigData_at_CERN_JP.pdf - slide 9, very interesting [15:33:04] the biggest cluster is not that far from what we manage [15:41:21] 10Analytics: logging level of cassandra should be warning or error but not debug - https://phabricator.wikimedia.org/T236698 (10fdans) a:03Nuria [15:41:54] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Upgrade to Superset 0.35 - https://phabricator.wikimedia.org/T236690 (10elukey) [15:46:19] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Upgrade to Superset 0.35 - https://phabricator.wikimedia.org/T236690 (10fdans) p:05Triage→03High [15:46:26] 10Analytics, 10Analytics-Kanban: Check Avro as potential better file format for wikitext-history - https://phabricator.wikimedia.org/T236687 (10fdans) p:05Triage→03High [15:57:26] joal: yes, I filtered for creates_by_self [15:57:43] ok great mgerlach - I was not sure :) [16:46:01] 10Analytics, 10Product-Analytics, 10VisualEditor: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10kzimmerman) @mforns it looks like the subtasks we had for this have all been resolved; can this be closed? [16:46:28] ottomata: if you have a min https://gerrit.wikimedia.org/r/#/c/operations/dns/+/546498/ [16:50:41] 10Analytics, 10Analytics-Kanban, 10Services (watching): Add cassandra loading job for top mediarequests - https://phabricator.wikimedia.org/T233717 (10Nuria) 05Open→03Resolved [16:50:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Services (watching): Create mediarequests top files AQS endpoint - https://phabricator.wikimedia.org/T233716 (10Nuria) [16:51:17] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Move reportupdater reports that pull data from eventlogging mysql to pull data from hadoop - https://phabricator.wikimedia.org/T223414 (10Nuria) 05Open→03Resolved [16:51:19] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Nuria) [16:56:58] hi a-team, I'm meeting people from the GII now. I just want to confirm that the geoeditors_monthly table is healthy and complete for 2019, is it? :) [16:57:53] dsaez: i think dan knows mostly but he's off today, maybe nuria can confirm? [17:00:02] ottomata, thanks, I'll ping dan later... I did a quick check and looks good to me... [17:00:51] dsaez, I also just checked and it looks good to me [17:01:14] data is present from 2019-01 to 2019-09 and volume of data seems legit [17:02:22] dsaez, data format and contents seem OK as well [17:03:21] dsaez (cc mforns ) table is not geoeditors [17:03:29] as gii does not care about editors but edits [17:03:39] oh [17:03:40] dsaez, mforns table is geoeditors_edits_monthly [17:03:45] I see [17:04:00] checking [17:04:00] dsaez, mforns : that was the source of the missunderstanding earlier this year [17:05:31] nuria, ok, I was checking geoeditors_monthly , and looks good [17:06:49] dsaez, geoeditors_edits_monthly looks equally good! [17:07:14] perfect! [17:07:18] thx! [17:19:58] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Neil_P._Quinn_WMF) The [ChangesListHighlights](https://meta.wikimedia.org/wiki/Schema:ChangesListHighlights) schema is still active (1,221 events in the past week), but is still... [17:22:13] Gone for tonight team, will see you tomorrow :) [17:51:35] (03CR) 10Mforns: [C: 04-2] "After our discussions on data quality alarms, we decided to not use TSV reports nor Dashiki dashboards. We will rather use Presto and Supe" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541557 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [17:52:49] (03CR) 10Mforns: [C: 04-2] "After our discussions on data quality alarms, we decided to not use TSV reports nor Dashiki dashboards. We will rather use Presto and Supe" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/546212 (https://phabricator.wikimedia.org/T235486) (owner: 10Mforns) [18:14:26] (03PS3) 10Ottomata: Add HDFSCleaner to aid in cleaning HDFS tmp directories [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/543897 (https://phabricator.wikimedia.org/T235200) [18:14:39] (03CR) 10Ottomata: "Ok, this is ready for review!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/543897 (https://phabricator.wikimedia.org/T235200) (owner: 10Ottomata) [18:29:35] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Team-Backlog, 10Epic: Event Platform Client Libraries - https://phabricator.wikimedia.org/T228175 (10kzimmerman) [18:29:53] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) Reporting a chat with Rob on IRC. We could do the following as test: 1) start with kafka-jumbo1001, schedule downtime and stop kafka. Also systemctl mask... [18:30:16] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10elukey) When we are ready we can coordinate to add the new NIC to kafka-jumbo1001 :) [18:31:01] nuria, should the improved data quality table have a dt field instead of year, month, day and hour fields? [18:31:31] I think year, month, day and hour are not strictly needed now that they will not be partitions any more [18:32:02] elukey: you still around? [18:32:05] want to talk about eventgate-logging? [18:32:50] ottomata: I am yes, but I'd log off in 10/15 mins max.. is it good enough? Otherwise we can do tomorrow [18:32:58] nuria, and even, we could have just one data_quality table for all granularities, instead of one table per granularity like we have now: data_quality_hourly would become data_quality [18:33:27] elukey: either way? maybe if we talk now and you have time tomorrow before I get up you can work on it? [18:34:29] ottomata: sure, let me join the bat cave [18:34:36] k! [18:38:30] ottomata, when you're free, can I sync up with you about the solution we found in grosking to the data quality alarms stuff, and also check some naming with you? [19:00:49] 10Analytics, 10Product-Analytics, 10VisualEditor: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10mforns) > @mforns it looks like the subtasks we had for this have all been resolved; can this be closed? Yes! And thank you *a lot*... [19:06:10] nuria: ping on https://phabricator.wikimedia.org/T233891#5554625 luca wanted confirmation that those NavigationTiming tables are ok to drop [19:08:14] ottomata: looking [19:09:37] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Ottomata) Checking in on this. > If there are tables that we can drop because related to old/not-wanted EL schemas, let's do... [19:09:52] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Hive data quality alarms pipeline - https://phabricator.wikimedia.org/T235486 (10mforns) [19:10:15] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Hive data quality alarms pipeline - https://phabricator.wikimedia.org/T235486 (10mforns) [19:14:33] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Nuria) In this case we decided to drop the data for the subtasks linked, the rest will remain in host that from now on will a... [19:21:07] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar): Drop Navigationtiming data entirely from mysql storage? - https://phabricator.wikimedia.org/T233891 (10Nuria) Spot checked these and while they have data for 2018/2019 I can not see any for 2017 so per @Gilles comment ab... [19:21:08] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Ottomata) Ahhh ok right. Right. > sanitize data in the log databases This shouldn't be necessary, right? IIUC, data shoul... [19:23:48] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10elukey) >>! In T231858#5612967, @Ottomata wrote: > Ahhh ok right. Right. > >> sanitize data in the log databases > This sh... [19:23:53] * elukey off! [19:25:23] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Nuria) In this case is more than renaming, schema seems too deeply nested to be persisted in hive (json object inside json object). Given that there does not seem to be much int... [19:46:52] ottomata, did you see my last ping? I don't want to start refactoring before you, too, give the OK :P [19:47:09] i didn't! [19:47:18] mforns: sorry! [19:47:23] no problemo [19:47:25] want to bc or IRC? [19:47:28] if bc gimme 6 mins [19:47:43] what you prefer, but maybe bc will be faster, sure I wait! [19:49:59] k ya gimme just a few mins [19:51:15] actually mforns my thing is taking a hwile to run [19:51:18] to the BC! [19:52:20] ah nm it finished... [19:52:20] haha [19:53:46] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Ottomata) Hm actually, looking at the current ChangesListHighlights schema, I don't see any reason why it couldn't be refined. It has a clearly defined schema (sub objects are... [19:56:06] 10Analytics, 10Product-Analytics: Start refining ChangesListHighlights events - https://phabricator.wikimedia.org/T212367 (10Ottomata) Ah ha! I was able to fix this schema. It's `filters` field had its array items defined as an array itself, rather than an object, so the refine job wasn't able to figure out... [19:56:49] mforns: ok in bc [19:58:23] ottomata, joining [20:18:41] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Nuria) @Ottomata nice! let me update the other task [20:22:48] 10Analytics, 10Product-Analytics, 10Patch-For-Review: Start refining ChangesListHighlights events - https://phabricator.wikimedia.org/T212367 (10Nuria) I see data and while i think filters filed looks strange as is I guess the schema defines it this way. 2019-10-27T18:23:14Z {"action":"set","filters":[{"na... [20:31:27] 10Analytics, 10Product-Analytics, 10Patch-For-Review: Start refining ChangesListHighlights events - https://phabricator.wikimedia.org/T212367 (10Nuria) per @Catrope looks like this schema is not used and can be retired, if so devs need to do changes to stop sending events. [20:48:35] 10Analytics: Request for a large request data set for caching research and tuning - https://phabricator.wikimedia.org/T225538 (10lexnasser) Hi @Danielsberger, I’m working on compiling this new public dataset for your caching research. I had a few questions that I hope you could answer so that I could get a bette... [20:49:31] nuria: added comment on https://phabricator.wikimedia.org/T225538 as request for more info from daniel. I will most likely start test querying with jupyter later today [20:50:43] lexnasser: nice [20:56:23] 10Analytics: Remove postal code and longitude / latitude from geocoded data object - https://phabricator.wikimedia.org/T236740 (10Nuria) [20:57:46] (03CR) 10Mforns: Update to include 1.33 and 1.34 (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/545917 (https://phabricator.wikimedia.org/T223414) (owner: 10Cicalese) [21:37:01] 10Analytics, 10Product-Analytics, 10VisualEditor, 10User-Ryasmeen: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10Neil_P._Quinn_WMF) 05Open→03Resolved a:03Neil_P._Quinn_WMF {meme, src="tech-barnstar"} [21:37:27] 10Analytics, 10Product-Analytics: Hash all pageTokens or temporary identifiers from the EL Sanitization white-list - https://phabricator.wikimedia.org/T220410 (10Neil_P._Quinn_WMF) a:05Neil_P._Quinn_WMF→03None [22:58:29] 10Analytics, 10Community-Tech, 10Product-Analytics (Kanban): Hash all pageTokens or temporary identifiers from the EL Sanitization white-list for Community Tech - https://phabricator.wikimedia.org/T226861 (10MaxSem) @aezell We would need to go through a list of our team's projects and review whether each of...