[00:53:20] PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:38:03] 10Analytics: Decomission notebook hosts - https://phabricator.wikimedia.org/T249752 (10AndyRussG) Hi! I backed up everything I had on these hosts and shut down notebooks. Thanks so much!!! [04:40:17] 10Analytics, 10MediaWiki-API: mostviewed generator not returning any results - https://phabricator.wikimedia.org/T254211 (10Nuria) The pageview api works fine, https://wikimedia.org/api/rest_v1/metrics/pageviews/top/fr.wikipedia.org/all-access/2020/05/all-days so issue is likely in the glue code with the php a... [05:34:36] 10Analytics, 10Analytics-Kanban: Spike, see how easy/hard is to scoop all tables from Eventlogging log database - https://phabricator.wikimedia.org/T250709 (10Marostegui) Nice!!!!!!! \o/ [05:55:07] 10Analytics: Investigate why netflow hive_to_druid job is so slow - https://phabricator.wikimedia.org/T254383 (10elukey) Summary of what I have gathered with Joseph and Marcel (please correct me if I am wrong): * the data shows a big jump around march since it is exactly 90d ago, so the limit for sanitization (... [05:56:42] good morning [05:56:48] going to reimage druid1005 in a few [06:03:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Upgrade Druid to Debian Buster - https://phabricator.wikimedia.org/T253980 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` druid1005.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimag... [06:17:39] elukey: quick note - the netflow hourly timer seems disable - Is it expected? [06:18:44] elukey: Ah - could it be that it is running form sumoewhere else? [06:19:12] Cause I also see that camus-webrequest and camus-mediawiki_analytics_events are disabled [06:19:15] weird :( [06:19:28] I need to leave for kids, will check later [06:19:30] bonjour [06:19:36] what do you mean disabled? [06:19:51] elukey: no date for NEXT nor LEFT [06:20:18] elukey: for eventlogging_to_druid_netflow_hourly.timer, last run was 2 hours ago [06:20:21] Fri 2020-06-05 06:20:00 UTC 3s left Fri 2020-06-05 06:10:01 UTC 9min ago camus-webrequest.timer [06:21:10] elukey: it still shows me no NEXT for webrequest - but it tells me it has run 25s ago - probably just a GUI issue [06:21:10] will check the hourly one [06:21:30] joal: yeah I think it is probably something in list-timers [06:21:46] ack elukey - thanks for that - I confirm runs have been ok until today ~4am [06:21:49] latrer! [06:22:12] nice :) [06:23:42] RECOVERY - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:26:24] (03Abandoned) 10Elukey: Upgrade to upstream version 1.24.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/602367 (https://phabricator.wikimedia.org/T253294) (owner: 10Elukey) [07:30:51] (03PS1) 10Elukey: scap: update targets [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/602599 [07:31:19] (03CR) 10Elukey: [V: 03+2 C: 03+2] scap: update targets [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/602599 (owner: 10Elukey) [07:46:25] 10Analytics, 10Analytics-Kanban: Upgrade Druid to Debian Buster - https://phabricator.wikimedia.org/T253980 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['druid1005.eqiad.wmnet'] ` and were **ALL** successful. [07:49:26] (03PS1) 10Elukey: Upgrade to upstream version 1.24.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/602604 (https://phabricator.wikimedia.org/T253294) [07:53:11] new turnilo version ready for testing on an-tool1005 [07:56:41] 10Analytics, 10Patch-For-Review: Upgrade turnilo to latest upstream - https://phabricator.wikimedia.org/T253294 (10elukey) Added Turnilo to an-tool1005 (the superset staging instance) and deployed 1.24 to it. Sent an email to the team to test, after the green light I'll deploy to prod. [07:57:01] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Upgrade turnilo to latest upstream - https://phabricator.wikimedia.org/T253294 (10elukey) a:03elukey [07:57:49] 10Analytics, 10Analytics-Kanban, 10Operations: Increase memory available for an-launcher1001 - https://phabricator.wikimedia.org/T254125 (10elukey) >>! In T254125#6193413, @Milimetric wrote: > Can we re-enable reportupdater on the machine now? Already done a couple of days ago :) [09:11:15] (03PS52) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [09:12:35] (03PS53) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [09:26:18] !log roll restart cassandra on AQS to pick up openjdk upgrades [09:26:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:30:49] brb [09:57:16] (03PS54) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [10:08:13] (03PS55) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [10:10:28] (03PS56) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [10:12:26] (03PS57) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [10:15:13] 10Analytics, 10Operations, 10serviceops, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_C matomo1002.eqiad.wmnet --vcpus 4 --memory 8 --disk 50 START - Cookbook sre.ganeti.makevm Ready to c... [10:15:24] 10Analytics, 10Analytics-Kanban: Move Matomo to Debian Buster - https://phabricator.wikimedia.org/T252740 (10elukey) [10:15:28] 10Analytics, 10Operations, 10serviceops, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) 05Stalled→03Open [10:26:25] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10elukey) [10:37:24] roll restart of aqs completed, all good afaics [10:37:32] going afk for lunch! [10:59:08] (03PS58) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [11:23:46] Now that's interesting - The reason for the druid netflow-hourly ingestion task to be stuck waiting for lock is because of late data: last data imported by Camus for hour 22 of 2020-06-04 (yesterday, hour of the stuck ingestion task) is today 10:38 UTC [11:25:16] 10Analytics, 10Analytics-Kanban: Upgrade Druid to Debian Buster - https://phabricator.wikimedia.org/T253980 (10Volans) >>! In T253980#6191340, @elukey wrote: > Interesting: the last bit of reimage failed for: > > ` > 05:33:42 | cumin1001.eqiad.wmnet | Puppet run completed > 05:33:42 | druid1003.eqiad.wmnet |... [11:25:33] Waiting for next run to see if new data still comes in [11:29:16] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure: MobileWikiAppProtectedEditAttempt: 'protectionStatus' is a required property - https://phabricator.wikimedia.org/T254567 (10MarcoAurelio) [11:29:23] 10Analytics, 10Analytics-Kanban: Upgrade Druid to Debian Buster - https://phabricator.wikimedia.org/T253980 (10Volans) >>! In T253980#6195883, @Volans wrote: > I can't find from the logs a reasonable explanation right now of why it happened. And as soon as I pressed submit I actually noticed it... the problem... [11:40:33] 10Analytics, 10Analytics-General-or-Unknown, 10AbuseFilter: Provide regular cross-wiki reports on abuse filters actions - https://phabricator.wikimedia.org/T44359 (10DannyS712) [12:38:53] (03CR) 10Joal: "Code is good, discussing the approach." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602463 (owner: 10Ottomata) [12:49:12] (03CR) 10Joal: "Minor comments - Will be extremely useful :)" (034 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602475 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [12:56:09] (03PS59) 10Fdans: Add Pageviews Complete dumps backfilling job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) [12:59:04] 10Analytics, 10Operations, 10netops: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10JAllemandou) [12:59:08] elukey: so that you know --^ [13:00:43] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Upgrade Druid to Debian Buster - https://phabricator.wikimedia.org/T253980 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` druid1006.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimag... [13:12:30] joal: nice! great work [13:13:01] (03CR) 10Fdans: "Applied comments, job tested successfully" (0313 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) (owner: 10Fdans) [13:25:53] 10Analytics, 10Operations, 10netops: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10elukey) @ayounsi is it ok to set the "Time" field to `stamp_updated` rather than `stamp_inserted` ? As Joseph pointed out we have some weird situations lik... [13:26:15] !log reimage druid1006 to debian buster [13:26:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:35:43] 10Analytics, 10Analytics-Kanban: Upgrade Druid to Debian Buster - https://phabricator.wikimedia.org/T253980 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['druid1006.eqiad.wmnet'] ` and were **ALL** successful. [13:35:57] all druid nodes on debian buster!!! [13:35:59] * elukey dances [13:36:03] \o/ ! [13:36:09] * joal dances with elukey :) [13:36:31] 10Analytics, 10Analytics-Kanban: Upgrade Druid to Debian Buster - https://phabricator.wikimedia.org/T253980 (10elukey) All nodes migrated! [13:36:41] 10Analytics, 10Analytics-Kanban: Upgrade Druid to Debian Buster - https://phabricator.wikimedia.org/T253980 (10elukey) [13:47:14] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure: MobileWikiAppProtectedEditAttempt: 'protectionStatus' is a required property - https://phabricator.wikimedia.org/T254567 (10Ottomata) @mpopov [13:48:38] (03CR) 10Joal: "Bunch of comments. Another broader one: Couldn't we, for the migration, change kafka topics? Like produce events using new system to new t" (0310 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/601865 (owner: 10Ottomata) [14:10:56] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Ottomata) [14:20:18] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Ottomata) > Our approach: Another idea. Analytics Engineering generates [[ https://dumps.wikimedia.org/other/analytics/ | other kinds of dumps ]] in Hadoop. You could too! - Use [[... [14:22:04] ha dan, i just posted the same thing you emailed [14:22:05] https://phabricator.wikimedia.org/T254275#6196493 [14:22:06] milimetric: [14:31:50] (03CR) 10Joal: "Naming and moving comments :)" (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/597541 (https://phabricator.wikimedia.org/T252857) (owner: 10Fdans) [14:38:09] ottomata: yeah, this morning it feels like I learned about three projects that are all trying to duplicate what we're building [14:38:19] hopefully we can help stop that from happening [14:39:37] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Milimetric) Having all the content in HDFS would be hugely useful, and doing it right would imply solving _so_ many problems. We should definitely talk about this. [14:40:18] yeah [14:40:37] this tech / product divide is kinda a problem [14:41:49] this could also mean that despite having an interesting infra in place, people don't know about us :) [14:41:57] ottomata, milimetric --^ [14:42:23] joal: "people don't know about us" is starting to bother me [14:42:24] it's mostly false about events, as ottomata is doing great work at evangelizing, but for the rest... [14:42:33] milimetric: I hear that [14:42:37] i think part of the problem is java world [14:42:44] possible [14:42:53] let's not even mention scala :) [14:42:57] i think product teams/engineers ignore things they dont' know [14:43:01] i mean, but I don't think WMF is thinking as a cohesive whole, still very siloed [14:43:23] it feels a bit like its been getting worse over the last year somehow, just a feeling [14:43:27] that may be, but if someone wants to build Druid or Kafka in PHP like... they're welcome to [14:43:29] until then... [14:43:32] maybe because produce engineers are trying to do cool things? [14:43:41] yeah, that's probably right [14:44:30] whatever the case, I just *really* don't want yet another way to generate dumps. That's one of those "hold my breath until you stop" kind of situations [14:45:15] milimetric: http://1.bp.blogspot.com/_8an0Zt4q_JQ/SrCoWbIxAXI/AAAAAAAAAUM/0YTSdS83b6U/s320/ast%C3%A9rix.gif :) [14:46:06] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10ArielGlenn) Would the idea be to also pull the html out of hdfs to make it available to dump downloaders, @Ottomata ? [14:46:26] 10Analytics, 10VPS-project-codesearch: Add analytics/* gerrit repos to code search - https://phabricator.wikimedia.org/T249318 (10Milimetric) poke @Ladsgroup / @Legoktm: any guidance here before I start a refactor? [14:47:27] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Ottomata) Ya. [14:48:51] joal: you have a second to discuss naming? [14:49:43] sure fdans - we should also brin milimetric in my opinion [14:50:21] omw [14:53:46] (03CR) 10Ottomata: "> Patch Set 1:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602463 (owner: 10Ottomata) [14:54:14] joal: did you see this one? https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/601749 [14:54:29] nope ottomata [14:54:39] will look after meeting [14:54:46] heya teammm [14:59:29] (03CR) 10Nuria: [C: 03+2] Fix permalink logic [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/602497 (https://phabricator.wikimedia.org/T254076) (owner: 10Milimetric) [14:59:37] (03CR) 10Nuria: [C: 03+2] "Tested, looks good." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/602497 (https://phabricator.wikimedia.org/T254076) (owner: 10Milimetric) [15:00:45] (03Merged) 10jenkins-bot: Fix permalink logic [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/602497 (https://phabricator.wikimedia.org/T254076) (owner: 10Milimetric) [15:01:05] (03PS2) 10Ottomata: Add EvolveHiveTable CLI tool to manually evolve Hive tables from JSONSchemas [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602475 (https://phabricator.wikimedia.org/T238230) [15:02:27] (03CR) 10Ottomata: Add EvolveHiveTable CLI tool to manually evolve Hive tables from JSONSchemas (034 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602475 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:04:46] (03PS3) 10Ottomata: Add EvolveHiveTable CLI tool to manually evolve Hive tables from JSONSchemas [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602475 (https://phabricator.wikimedia.org/T238230) [15:09:06] (03CR) 10Ottomata: "> Patch Set 3:" (0310 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/601865 (owner: 10Ottomata) [15:14:27] milimetric: o/ [15:14:36] hi luca [15:15:00] is the css bug in turnilo a "no" this hits my eyes so badly or "yes" green light :D ? [15:15:24] yes green light [15:15:33] I left finding it as an exercise to the reader [15:15:40] and I have a side betting pool that nobody sees it [15:15:40] :) [15:16:36] me for sure, my css skills are none :D [15:16:54] fdans will surely be nerd sniped [15:17:32] and I am starting to suspect that you made up a non-existent bug to force him to read the css source for hours [15:17:53] (If so I completely support what you did :D) [15:17:59] oh no it's visible without looking at the CSS [15:18:16] sadly, I like your way better, should've thought of it [15:19:24] I am so sad, I was already picturing Fran swearing in Spanish [15:22:24] 10Analytics, 10MediaWiki-API: mostviewed generator not returning any results - https://phabricator.wikimedia.org/T254211 (10Milimetric) 05Invalid→03Open My guess is they're both fine and caching is to blame, but I'm not sure how to dig through the cache behind the API. Opening and putting on Radar. We sh... [15:22:55] 10Analytics, 10Operations, 10netops: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10CDanis) Hopefully Arzhel understands this better than I do, but here's my rough understanding: * 'Normally' routers send netflow just on the end of a flow... [15:25:17] ottomata: any blockers to just +2 https://gerrit.wikimedia.org/r/#/c/mediawiki/vagrant/+/602488/? [15:28:02] 10Analytics: Mediawiki History dumps unique editors feature request - https://phabricator.wikimedia.org/T254234 (10Milimetric) The second version with a user_id sorting is something we're still considering. But to be clear, the queryable version would come with a cluster powerful enough for you to do operations... [15:31:33] joal: elukey: I think I know what's going on re: netflow, and I think it even explains other weird things we've seen [15:31:58] great cdanis! Please tell me more :) [15:32:14] tldr we're asking pmacct/nfacct to aggregate things this way [15:32:51] our routers report info even for still-active flows every 10sec, but they remember the start timestamp, and nfacct aggregates based on that when reporting to kafka, instead of bucketing by the time the netflow report arrived at nfacct [15:33:29] if Arzhel approves I am about to try a config file change for esams, and if it works, I think we'll see these GRE flows show their 'actual' bps/pps, instead of the once-every-several-hours-to-days reports we see now [15:33:45] (03CR) 10Ottomata: Refine - Make event transform functions smarter about choosing which possible column to use (035 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/601865 (owner: 10Ottomata) [15:34:20] (03PS4) 10Ottomata: Refine - Make event transform functions smarter about choosing which possible column to use [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/601865 [15:35:06] 10Analytics: Mediawiki History dumps unique editors feature request - https://phabricator.wikimedia.org/T254234 (10marcmiquel) I totally understand. I would only ask you to consider releasing files (datasets or dumps of any kind) additional to the queryable version, because I need to use the data from the 300 la... [15:35:09] cdanis: wouldn't use the `stamp_updated` of your report exactly what you describe? [15:35:30] joal: I think that will just give the same huge spike at the end of the flow, rather than the start? [15:35:49] not sure if we have many reports for that same flow with different stamp_updateds [15:35:57] cdanis: hm, wouldn't the news come along as reparts are generated? [15:36:14] joal: thanks for review of transform funcs, qq. aren't i respecting the explicit column order when COALESCing? [15:36:20] the list of columns is declared in a seq [15:36:30] it looks like it does, from the two lines you pasted, but it looks like it's still reporting sums along the way, not sure how that will get aggregated into druid [15:37:34] cdanis: you'll have to be veru careful indeed, as sums of already sumed data will get big - I get how a change of conf might help --> only sending events relevant to the period, not from the past [15:37:57] 10Analytics, 10Product-Analytics, 10MW-1.35-notes (1.35.0-wmf.35; 2020-06-02), 10Patch-For-Review: [Spike] Should EventLogging support DNT? - https://phabricator.wikimedia.org/T252438 (10Milimetric) > Yeah it's not stated in the spec because I think it doesn't really want to define a 'why' for using it, bu... [15:38:04] yeah, I'm worried that if we just used stamp_updated, we'd just wind up with ever-increasing amounts of this traffic reported in druid [15:38:07] elukey: excuse me luca how dare you talk to me in silent day [15:38:08] ottomata: You were respecting order yes, I'm asking for explicitation in doc is all (sorry for not clear) [15:38:20] OHHh ok, doesn't it say the first one found will be used? [15:38:29] ottomata: aswe coalesce, order is (more) important, so let's mention it? [15:38:30] anyway I am about to try the new config just in esams :) [15:39:00] milimetric joal soooooo are we going to load to pageview_hourly like some total rockandrollers? [15:40:22] joal the doc says 'the column used will be the first non null value in the input DataFrame records' [15:40:23] ? [15:41:40] ottomata: right - the column-names are just after that - I guess it makes sense - I think I was after something like ' fro mcolumns (ip, clientip) in that order' - but that's documented as code [15:41:54] ou can keep it this way ottomata :) [15:42:12] the columns are listed right above that in docs? [15:42:30] haha, no joal happy to make clearer just not sure what I am missing [15:42:47] (03CR) 10Joal: Add EvolveHiveTable CLI tool to manually evolve Hive tables from JSONSchemas (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602475 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:42:48] fdans: the one downside we saw is that if someone does a query across the two sets, many dimensions would be missing for old data and if they group by those they might be confused. But I feel like if someone saw those results they'd at least read the docs. So I'm still for it, not sure if joal is [15:42:49] i list the columns, then say the first non null one will be used [15:43:07] yes [15:43:23] (03CR) 10Ottomata: Add EvolveHiveTable CLI tool to manually evolve Hive tables from JSONSchemas (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602475 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [15:44:55] milimetric joal I agree with dan [15:45:21] ottomata: maybe add "in the order defined above"? very much detail :) [15:45:27] k [15:45:41] thanks ottomata - sorry for being nitpicky [15:46:19] also ottomata, I had forgotten how complicated the legacy user-agent stuff was - maaaaaaaan [15:46:27] eah [15:46:41] joal: https://w.wiki/T7G :) [15:47:45] 10Analytics-Kanban, 10Trash: --- DISCUSSED BELOW --- - https://phabricator.wikimedia.org/T114124 (10Milimetric) Heh, this is really convenient and we hate too many columns. And believe it or not, in the 5 years we've used this, it's been great, no errors. I'm actually surprised by that too, since it's so hac... [15:48:13] fdans, milimetric: I agree loading pageview is the goal - now possibly we should make sure we communicate around before doing that - I suggest using pageview_historical for now, and take a stance at moving that into pageview when everything is backfilled, and with proper communication (this would also mean we could possibly load that into cassandra without the need of a historical endpoint, [15:48:19] maybe? [15:48:44] joal milimetric in wikistats that's a great use of annotations [15:48:46] cdanis: while I see the graphs, I must say I need help for interpretation :) [15:49:12] joal: so previously, that GRE traffic got reported as giant spikes every ~day or so [15:49:39] fdans: yeah, I feel like nobody knows about those annotations but I'm pretty proud of how they work. [15:49:41] see also https://w.wiki/T7C [15:50:16] 8.0Tbyte in 5 minutes is physically impossible :) [15:50:17] (03PS5) 10Ottomata: Refine - Make event transform functions smarter about choosing which possible column to use [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/601865 [15:50:43] now, for the collector I just restarted with the new config, it's being reported and time-bucketed as the traffic arrives [15:51:09] cdanis: I get :) Thanks a lot for explanations :) [15:51:14] (03CR) 10Ottomata: Refine - Make event transform functions smarter about choosing which possible column to use (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/601865 (owner: 10Ottomata) [15:51:26] I'll make a few more notes on the task and roll out the config change everywhere [15:51:36] thanks, this thing with the GRE flows had been bugging me for months :D [15:51:45] cdanis: well, I assume this change would also remove what I call late-data, as data is aggregated/reported on the flow [15:52:15] you mean where you see a record with a stamp_inserted far in the past? I think so [15:52:31] That's also what I htink cdanis - great - Thanks a lot :) [15:53:18] also ottomata - how about my more general comment of using different kafka topics for moving events from 1 platform to the other? [15:53:33] ottomata: that would prevent using the change you just devised :) [15:54:17] 10Analytics, 10Pageviews-API: Track page views by page ID rather than title (handles moved pages) - https://phabricator.wikimedia.org/T159046 (10Milimetric) p:05Medium→03High Elevating priority per discussion around T251777#6119752 [15:54:53] joal: responded in earlier comment [15:55:00] ah sorry [15:55:02] reading [15:55:15] tldr, maybe we could change topics, but we still need to keep the same hive tables [15:55:50] ottomata: true - so schema evolution - But no dedup-needed, as sources would not be joined [15:56:21] joal: the problem is incrmental rollout too [15:56:34] some wikis will produce old style events, others will produce new ones [15:56:42] if they go to different topics, those are different source datasets [15:56:49] and refine can't be used to write multiple jobs to the same partition [15:56:54] since it overwrites [15:57:06] if we didnt' need incremental rollout, we could've stuck with my last patch [15:57:07] acutally no [15:57:13] event without incremental rollout [15:57:18] ottomata: that would mean half of data added to the table by one way, one via the other - This would imply using append instead of overwrite, which is problematic in case of rerun [15:57:33] if e.g. we swiched SearchSatisfaction on all wikis all at once to eventgate [15:57:44] there'd still be ONE our for which we couldn't refine properly [15:57:50] yeah [15:58:29] ottomata: I actually think having 2 datasets for incremental roolout makes it easier [15:58:41] 10Analytics: Better redirect handling for pageview API - https://phabricator.wikimedia.org/T121912 (10Milimetric) p:05Low→03High We have to re-groom sometimes soon, there are too many high priority things, but this certainly deserves consideration. By the way, our infrastructure and code is very friendly to... [15:58:48] the reason for this patch is that there will be hours for which on some records http.request_headers.user-agent is set and needs parsed, and for others, that field is not set and userAgent is already parsed by eventlogging-processor [15:58:55] after that, we can tweak refine to union 2 datasets for the period, and with your patch, it should work [15:59:16] ottomata: the other solution is to allow refine to append, and closely monitor/check when rerunning [15:59:24] joal that means we need to do manual work for every EL schema [15:59:41] ottomata: only in case of rerun [15:59:55] ottomata: In any case I think the migration will be operationally heavy [16:00:06] hm, as is it isn't too bad [16:00:10] right now it is: [16:00:15] pre-evolve hive table [16:00:26] move schema to eventgate via mw-config deploys [16:00:45] once all wikis are migrated, use new refine job with MEP schema for that schema [16:01:12] if we have multiple datasets, we have to run special manual refine jobs for each hour during the migration [16:01:14] ottomata: still a lot to be done for every schema [16:01:26] and a lot of manual refinement [16:01:30] no [16:01:31] no manual refinement [16:01:42] that's why the patch I made [16:01:45] 18:01:12 < ottomata> if we have multiple datasets, we have to run special manual refine jobs for each hour during the migration [16:01:52] so the same refine for meta schemas can continue to be used until the migration is complete [16:01:55] yes [16:02:00] IF we have multiple datasets (your suggestion) [16:02:02] i mean [16:02:06] right now we have only one [16:02:09] per schema [16:02:17] e.g. eventlogging_SearchSatisfaction [16:02:23] so one refine job per hour per schema, just as it is now [16:02:30] I hear that - and events with different schemas will be mixed to the same kafka topic, is that right? [16:02:31] so no changes are needed to refine process during migration [16:02:37] they are compatible schemas [16:03:07] ok I think I get it ottomata [16:03:08] the new MEP schema can be considered a new version of the metawiki EL schema [16:03:12] yup [16:04:11] ok cool - I guess the only downside is then that some interesting conventions from new-streams are still not followed by the newly migrated-old-ones [16:04:16] ottomata: --^ [16:04:43] joal like what? [16:04:52] like datacenter [16:04:55] ah yes [16:04:59] (03CR) 10Joal: [C: 03+1] "Thanks for the nitpicky changes :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/601865 (owner: 10Ottomata) [16:05:05] 10Analytics, 10Core Platform Team Workboards (Initiatives): Design Document that proposes an alternative architecture for historic data endpoints - https://phabricator.wikimedia.org/T241184 (10Naike) @WDoranWMF should this task be in the To Do column as there's no one assigned to it? [16:05:06] we can consider doing that migration later [16:05:08] if we want to [16:05:14] i separated it out from this one to keep things simpler [16:05:42] indeed I was thinking of that - Seems perfect ottomata :) [16:05:45] this migration is about using eventgate and MEP schema. another one can be about prefixing legacy EL streams and addign partition [16:05:47] thanks for explanations! [16:05:54] ya thanks for reviews and asking! :) [16:07:46] milimetric, fdans: is it my suggestion keeping you speechless like that? [16:09:20] joal: wooo sorry, no I think that's good [16:09:30] joal: no, sorry, it's fine, if Fran wants to communicate up front and do it all together he can do that, otherwise doing it in two steps seems fine too [16:09:30] :-P [16:09:32] 10Analytics, 10Operations, 10netops, 10Patch-For-Review: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10CDanis) 05Open→03Resolved The change looks effective: previously, GRE traffic was predominantly reported as ginormous-to-the-poin... [16:09:34] 10Analytics: Investigate why netflow hive_to_druid job is so slow - https://phabricator.wikimedia.org/T254383 (10CDanis) [16:09:59] told you milimetric and fdans - ranting Friday for me ;) [16:10:17] Thanks for answers - Let's keep stuff separated, easier to mess it :) [16:10:20] psh, you'll have to step up your rants man, have you seen Fran's? High bar [16:10:34] lol [16:10:54] joal: milimetric ok for me to keep them separated for now [16:10:55] I had never even consider competing [16:11:06] :) [16:11:15] wmf.pageview_historical then [16:11:20] k [16:11:21] great [16:12:44] nuria: hi! I just gave you permission for the BH data task... I think it was made private by mistake, just gonna check before fully removing restrictions [16:12:52] (03CR) 10Joal: "> Yeah I thought about that too, but I think we'd need a parameter and config to enable or disable that, and I don't want to spend the tim" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/602463 (owner: 10Ottomata) [16:13:53] 10Analytics, 10Operations, 10netops, 10Patch-For-Review: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10CDanis) As a last note: this should improve the accuracy of //all// long-lived flows (any with a duration longer than a minute); the... [16:16:16] 10Analytics, 10Operations, 10netops, 10Patch-For-Review: Ingestion semantic for netflow data sent to kafka generates late-data - https://phabricator.wikimedia.org/T254574 (10JAllemandou) Great explanation @CDanis - This change should also resolve hour late-data issue - One stone two birds :) [16:16:55] cdanis: nhs [16:16:58] uff sorry [16:17:05] I wanted to say: just read the task, wow [16:17:07] nice work [16:17:10] thanks! [16:17:26] so IIUC we shouldn't change anything on our settings (our == analytics) [16:17:28] the GRE thing had been bugging me for months [16:17:30] yeah, I think it is fine now [16:17:33] super [16:17:49] if you had used the stamp_updated instead, we'd also get incorrect data -- see the example in the task description [16:17:53] the counters are cumulative [16:18:09] the two records there are different minutes of the same flow [16:18:28] elukey: Those network engineers have somethin we don't - you try to help them, and finally it's always them helping you :) [16:18:42] hey, I am not a network engineer, I just pretend to be one ;) [16:18:47] * joal realizes this comment is also very valid for ops in general [16:19:44] cdanis: to thank you I'll update turnilo on monday to the latest version [16:19:50] :D [16:21:24] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 2 others: Eventlogging Client Side can use the stream config module to dynamically adjust sampling rates - https://phabricator.wikimedia.org/T234594 (10mpopov) a:03mpopov [16:22:01] also elukey, something interesting - netflow-hourly has not run for hours 0 to 8 today [16:22:25] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 2 others: Eventlogging Client Side can use the stream config module to dynamically adjust sampling rates - https://phabricator.wikimedia.org/T234594 (10mpopov) [16:22:28] elukey: I assume it's because the job got stuck and therefore timers didn't launch [16:25:08] 10Analytics, 10Event-Platform, 10Growth-Team, 10MediaWiki-Recent-changes, and 2 others: Remove deprecated RCFeedEngine support - https://phabricator.wikimedia.org/T250628 (10Milimetric) I've stayed mostly out of this part of Mediawiki. It seems to me we should do this kind of thing via kafka as opposed to... [16:26:09] elukey: forgot to ask, did we re-instated RU to run? [16:26:09] joal: in theory they should have all run, but probably the job stuck was causing the timer to end prematurely [16:26:17] nuria: yep yep [16:29:09] elukey: mmm.. something is missing cause the new reports i set up for reportupdater are not running , does mforns know how teh report updater source gets updated? [16:31:54] nuria: did we set up the systemd timers for them? [16:32:07] I mean, were they already working before etc.. [16:32:10] elukey: duh -> no [16:32:25] ah! then we can easily add them [16:32:31] elukey: i will look [16:34:13] nuria: should be profile::reportupdater::jobs [16:35:49] 10Analytics, 10Analytics-Kanban: Spike, see how easy/hard is to scoop all tables from Eventlogging log database - https://phabricator.wikimedia.org/T250709 (10Nuria) We still need to spot check all tables and make sure data is quaeryable, but yeah, almost there [16:44:12] (03CR) 10Elukey: [V: 03+2 C: 03+2] Upgrade to upstream version 1.24.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/602604 (https://phabricator.wikimedia.org/T253294) (owner: 10Elukey) [16:45:23] !log upgrade turnilo to 1.24.0 [16:45:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:45:51] cdanis: turnilo upgraded :) [16:46:54] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Upgrade turnilo to latest upstream - https://phabricator.wikimedia.org/T253294 (10elukey) Just deployed the new version! [16:47:20] 10Analytics, 10Analytics-Kanban: Test superset running on gunicorn + gevent - https://phabricator.wikimedia.org/T253545 (10elukey) [16:47:23] 10Analytics, 10Performance-Team: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 (10Nuria) [16:49:07] 10Analytics, 10Core Platform Team Workboards (Initiatives): Design Document that proposes an alternative architecture for historic data endpoints - https://phabricator.wikimedia.org/T241184 (10Nuria) a:03Milimetric [16:50:57] elukey: nice :) [16:51:34] 10Analytics, 10Core Platform Team Workboards (Initiatives): Design Document that proposes an alternative architecture for historic data endpoints - https://phabricator.wikimedia.org/T241184 (10Nuria) My apologies cause i though this task had been updated a while back, here is the document: https://docs.google... [16:53:50] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 2 others: Eventlogging Client Side can use the stream config module to dynamically adjust sampling rates - https://phabricator.wikimedia.org/T234594 (10mpopov) For testing in MediaWiki Vagrant, I had the following to LocalS... [16:53:55] PROBLEM - piwik.wikimedia.org HTTPS on matomo1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster [16:54:26] this is me --^ [16:54:30] new host [17:05:40] (piwik on buster, new vm) [17:10:31] RECOVERY - piwik.wikimedia.org HTTPS on matomo1002 is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster [17:24:59] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Product-Analytics: MobileWikiAppProtectedEditAttempt: 'protectionStatus' is a required property - https://phabricator.wikimedia.org/T254567 (10mpopov) @Ottomata @MarcoAurelio: I can't access that logstash site so I have no idea which o... [17:37:52] nuria: I thought the puppet job to trigger the new reports was already there [17:38:13] mforns: i thought so too for structured data' [17:38:23] mforns: cause i just added a new query [17:38:40] nuria: I can see the puppet trigger for structured data [17:39:00] mforns: right, so no other should be needed for the new query right? [17:40:29] did you make the script executable? [17:40:38] no, no other should be needed [17:41:56] yes, there are no exec permissions [17:42:09] that's why it's not running [17:42:50] ah sorry I thought it was a new RU job, my bad! [17:46:48] (03PS1) 10Mforns: Add exec permits to wikidata_usage_in_wikimedia_projects [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/602748 (https://phabricator.wikimedia.org/T247099) [17:47:34] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Self-merging to unbreak production" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/602748 (https://phabricator.wikimedia.org/T247099) (owner: 10Mforns) [17:47:54] nuria: this should fix it ^ [17:47:58] will check in a bit [17:54:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10elukey) [17:54:46] ottomata: o/ - on monday I'd need to roll restart all the kafka brokers to pick up the new openjdk (via coobook) [17:54:58] is it ok or something is in progress with the new nodes? [17:55:11] (just double checking before doing something harmful) [17:56:23] !log roll restart presto server on an-presto* to pick up new openjdk upgrades [17:56:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:57:46] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Product-Analytics: MobileWikiAppProtectedEditAttempt: 'protectionStatus' is a required property - https://phabricator.wikimedia.org/T254567 (10Ottomata) `lang=json { "_index": "logstash-2020.06.05", "_type": "eventlogging", "_id"... [17:59:33] elukey: ya is fine! [18:00:19] ack! [18:03:32] 10Analytics, 10Operations, 10serviceops, 10vm-requests: Create a VM for matomo1002 (eqiad) - https://phabricator.wikimedia.org/T252742 (10elukey) 05Open→03Resolved a:03elukey [18:03:33] 10Analytics, 10Analytics-Kanban: Move Matomo to Debian Buster - https://phabricator.wikimedia.org/T252740 (10elukey) [18:09:21] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Nuria) >Would the idea be to also pull the html out of hdfs to make it available to dump downloaders This already happens for the many pageview and edit dumps we release, in an hourly... [18:16:07] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Product-Analytics: MobileWikiAppProtectedEditAttempt: 'protectionStatus' is a required property - https://phabricator.wikimedia.org/T254567 (10mpopov) @Dbrant: The app version doesn't look right to me, does it look right to you? This... [18:26:22] * elukey off! [18:26:23] o/ [20:07:49] 10Analytics, 10Analytics-Kanban, 10Core Platform Team Workboards (Initiatives): Design Document that proposes an alternative architecture for historic data endpoints - https://phabricator.wikimedia.org/T241184 (10Milimetric) We basically split this effort into two phases. The design for the first, designing... [20:23:55] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Product-Analytics, 10Wikipedia-Android-App-Backlog: MobileWikiAppProtectedEditAttempt: 'protectionStatus' is a required property - https://phabricator.wikimedia.org/T254567 (10Dbrant) @mpopov You are correct -- this is likely some e... [21:03:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Upgrade turnilo to latest upstream - https://phabricator.wikimedia.org/T253294 (10JKatzWMF) @elukey I seem to have lost the ability to un-pin items from showing up in the graph. It's hard to describe, but using the colored squared seen in this screenshot,... [21:07:23] 10Analytics, 10Analytics-EventLogging, 10Beta-Cluster-Infrastructure, 10Product-Analytics, 10Wikipedia-Android-App-Backlog: MobileWikiAppProtectedEditAttempt: 'protectionStatus' is a required property - https://phabricator.wikimedia.org/T254567 (10mpopov) Thanks, Dmitry! [21:46:04] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Legoktm) >>! In T254275#6190666, @ArielGlenn wrote: > I have a number of questions folks may want to think about, after having read the document on officewiki. Can this document be sh... [21:52:38] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Joe) >>! In T254275#6196545, @Milimetric wrote: > Having all the content in HDFS would be hugely useful, and doing it right would imply solving _so_ many problems. We should definitel... [22:21:33] 10Analytics, 10Core Platform Team, 10Dumps-Generation: HTML Dumps - June/2020 - https://phabricator.wikimedia.org/T254275 (10Nuria) >Pull titles from https://dumps.wikimedia.org/other/pagetitles/, record to our database >Use Parsoid API https://en.wikipedia.org/api/rest_v1/ to record TIDs and pull HTML files...