[00:49:05] 10Analytics: Turnilo : Issue with Is Deleted and Is Reverted dimensions when added as Split - https://phabricator.wikimedia.org/T230853 (10Mayakp.wiki) [06:33:34] 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Yair_rand) Several examples of ways to find out a user's country: (Note: A lot of this depends on how frequently this report will... [07:28:53] Good morning elukey - Do you have a minute for a netflow segment discussion? [07:30:12] morning! [07:30:13] 10m is ok? [07:30:20] anytime :) [07:42:11] joal: all right I am good [07:42:25] what is the problem with the segments? [07:43:17] (I am checking the coordinator's UI) [07:43:46] elukey: when looking at segments in druid coordinator UI, you can see that for some hours, there is 1 segment, for others multiple ones (2 or 3) [07:44:21] elukey: interestingly, the segment naming is different when there is 1 segment or 2/3 [07:45:15] elukey: my guess is that using 6h for a task was too long in relation to the batch reindexation (my bad, I should have thought about it) [07:45:24] ah ok now I understand how to read that UI :D [07:45:25] finally [07:45:55] elukey: I think there sometimes is a clash between batch reindexed segments and realtime created one, which we don't want [07:46:06] for example, 2019-08-20T14 has 3 segments [07:46:13] correct [07:46:40] what do you suggest? Reduce to PT1H or similar? [07:47:09] I think it's best solution yes (sorry :( [07:47:44] it is an experimental datasource, we are experimenting :) [07:48:11] do you think that we should stop the supervisor, clean up the segments, restart everything? [07:48:42] elukey: I also think we should batch-reindex 2019-08-20T[14|15|21] and possibly some hours today (not sure yet) [07:48:53] kill supervisor indeed, and restart [07:49:10] I don't think there is value in cleaning segments as batch should do it [07:58:18] (sorry got lost in reading stuff) [07:58:44] ack, so PT1H joal ? [07:59:01] Yessir :) [07:59:04] Thanks for that [08:00:42] ok supervisor restarted [08:01:29] \o/ [08:02:11] about the batch reindex - I can see that the daily job does a window from -4 to -3 days ago [08:02:12] elukey: looking at UI, you can see segments of finished hours have been handed-off to deep-storage [08:02:45] elukey: about rerunning, we need to force as the jobs have been successful [08:02:50] what if we anticipate and do -4 to -1 or similar? It should replace segments with only a daily one right? [08:03:01] correct [08:06:32] so batch reindex from say 2019-08-17T00:00:00 to 2019-08-21T00:00:00 to be sure [08:06:44] also elukey, do you have an idea of where I could put my hands on the spark-examples.jar for 2.3 ? I looked around but didn't find it :( [08:07:03] If needed I'll download the full spark archive [08:07:21] if those are on the "regular" worker node we can copy it to the testing cluster [08:08:04] elukey: not seen there [08:08:31] ah there yo ugo [08:08:32] elukey@an-worker1080:~$ dpkg -S spark-examples.jar [08:08:32] spark-core: /usr/lib/spark/lib/spark-examples.jar [08:08:36] this is the file right? [08:08:43] I think it is :) [08:08:52] so the spark 1 packages should be cleaned up [08:09:01] spark-core in there is not right, we should remove it [08:09:01] oh no - /spark is the old one [08:09:10] 41984 [08:09:55] one thing that I didn't get - you were used to play with a spark2 examples jar that is not present anymore, or you'd like to find one for spark2.3 ? [08:10:12] I don't think I used to play with examples [08:10:54] ahhh okok [08:11:49] so I actually never looked for it :S [08:12:10] elukey: I'm gonna download the spark archive fro now [08:12:14] super [08:12:28] we could keep this in mind and ask to Andrew to add the examples to our package [08:12:32] it should be really easy [08:12:51] I think so [08:32:48] joal: currently working on the new zookeeper nodes [08:32:54] they are racked now :) [08:33:03] You man are the man :) [08:33:17] I'll install buster directly on them [08:33:29] the zookeeper version is still in the 3.4 series [08:33:53] yeah, that sounds like a good idea, the differences within the 3.4 branch are marginal [08:34:04] I knew you would have read :D [08:34:10] Hi moritzm :) [08:34:15] morningggg [08:34:42] moritzm: it feels you have an alarm for lines belonging onto your realm :) [08:39:11] I just coincidentally saw it pass by :-) [08:52:17] elukey: https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Hadoop_testing_cluster#Tests_by_Joal [08:52:49] joal: wow!!! [08:53:20] <3 [08:53:54] * joal does the happy dance :) [08:57:32] elukey: is conf-lvm.cfg really the best option for the partman recipe? these are Zookeeper-only, right? the partman recipe we use for conf* allocates 30G to /var/lib/etcd [08:58:11] moritzm: yeah it is probably not the best one, for the moment I wanted to make sure that a PXE boot worked :) [08:58:26] ok :-) [08:58:32] thanks for the review! [08:58:37] yw [08:58:42] elukey: I'm onto booking the place for the Berlin conference - Confirmation? [08:58:55] yep [08:59:02] Good :) [08:59:26] we can probably loop in mr ema in the discussion [09:00:02] Great idea! Dear M. ema, when you'll have a minute :) [09:05:37] elukey: I need a hand please - Con you sync my ldap account onto analytics1039-hue please? [09:05:58] ah yes [09:07:13] joal: should be ok! [09:07:37] hey teamm [09:07:39] :] [09:07:49] Good morning mforns :) [09:08:02] \o/ elukey [09:08:20] yesterday I got the mediawiki oozie job working :] [09:08:36] now, I'm going to launch a final test with all data [09:09:13] just fyi, because the (heavy) job will run for a couple (3-4) hours starting now [09:11:15] no prob mforns [09:11:22] mforns: how have you managed to run the thing? [09:11:37] bumping the ooie limit of workflow-number size? [09:11:38] ppffff, it was not the loop tree depth... [09:11:41] no [09:11:43] really? [09:11:48] ohhhh [09:12:04] * joal stands in suspens [09:12:06] although, having troubleshot that was good, because eventually we'll need to bump that limit up [09:12:13] the problem wasssssss [09:12:16] me [09:12:33] The famous "me-problem" [09:12:47] I kinda know it well [09:12:52] I forgot to check whether the partition source directory for the archive job existed [09:13:10] Right [09:13:15] and some wikis do not start at 2001, so... [09:13:28] the loop was iterating over 2001, 2002, ... [09:13:33] but some of them were not present [09:13:50] Bizarre that the error was so unclear [09:13:59] I'm not so sad, because oozie communicated that error really confusingly, [09:14:11] yea, no logs anywhere [09:14:22] hm [09:15:07] and HUE was telling me that the failed action was the script that worked perfectly when called by hand with same exact parameters [09:15:20] yes I recall that [09:15:35] I think the loop drives oozie mad in term of error reporting [09:15:46] I think, because when the later iteration failed, all actions that were being executed in parallel, would be marked as failed as well, without logs [09:15:56] yop [09:16:00] yea, agree, specifically the parallel loop [09:16:29] anywayyyyyyy [09:16:32] works now [09:16:49] well done buddy [09:19:21] yep +1 :) [09:27:04] * elukey be back in a bit [09:29:41] job launched :S [09:51:52] elukey: let me know when you're back :) [10:17:56] 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Erik_Zachte) @Yair_rand you examples show ingenuity, yet they also seem somewhat contrived. Suppose some malicious geeky and rathe... [10:23:10] 10Analytics, 10Reading Depth: Publish aggregated reading time dataset - https://phabricator.wikimedia.org/T230642 (10Groceryheist) Thanks Nuria! I think that both measures might be useful for different purposes. It might also be interesting to compare the two measures for example to see if people tend to op... [10:23:51] joal: here I am sorry, multiple people ringing at the doorbell :) [10:24:04] no problem elukey :) [10:24:40] elukey: various questions :) [10:25:02] elukey: have you managed / try to rerun jobs using hue on the kerb cluster? [10:25:24] elukey: subsidiary - Am I the only one not to see rerun buttons? [10:26:05] elukey: something else: can you confirm you receive alert emails from oozie (your address is the one in use of both SLA and error) [10:26:42] so yes I can re-run the jobs, not sure why you can't, maybe I need to make sure ad admin lemme check [10:26:48] emails are coming yes [10:27:01] elukey: Finally - Do you think I can run a temporary test sending data to graphite (test metric-name of course), or shall I find another way to test spark actions? [10:27:12] sorry for the questions spam! [10:27:18] nono please [10:27:26] so if we could avoid graphite it would be great [10:27:37] but if it is a quick one we can do it [10:27:57] There is a way to avoid :) [10:28:13] I'll do it without graphite [10:28:19] you are now a superuser in hue, can you check? [10:28:54] \o/ [10:29:00] all right good :) [10:29:12] ok great - I had played with manually restarting actions, but not so nice :) [10:30:03] last: there seem to be missing data for 2019-08-16 hour 13 (empty folder in raw) - I assume it was a test [10:30:40] for what datasource? [10:31:03] elukey: Before I continue I confirm with you - I plan on starting the pageview coord to test shell-action (used for archive), and the mobile_app_metrics for spark action [10:31:07] elukey: web [10:31:10] request [10:31:28] elukey: in kerb cluster as well [10:31:41] no idea about the raw empty folder :( [10:31:53] yes please do anything that you want with the cluster [10:33:35] WOW - interesting pattern! [10:35:06] elukey: I reran manually 2 oozie actions: sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL --rerun 0000492-190729133359585-oozie-oozi-C --action 36[23] [10:35:33] ah you can re-run only actions! [10:35:34] nice [10:35:35] They started, then failed at refine stage, leaving an empty raw-data hour folder !!!! both [10:35:44] Not cool [10:36:25] So the issue is not about data not having been imported, but about data having been deleted!!! [10:36:59] I am not understanding sorry [10:37:01] I have started reruns of action using hue, let's see if the same pattern occurs [10:37:16] is it a problem for the test cluster or oozie in general? [10:37:21] elukey: test clustewr [10:37:37] or at least I have not seen that in the regular cluster [10:38:20] interesting - it looks like rerun through hue doesn't cause the problem [10:43:53] joal: what is the failure at refine stage that you talked about? [10:43:57] something related to auth? [10:44:12] I don't think [10:44:24] can you point me to it ? [10:44:38] elukey: http://localhost:8080/oozie/list_oozie_workflow/0000144-190820124324114-oozie-oozi-W/?coordinator_job_id=0000492-190729133359585-oozie-oozi-C&bundle_job_id=0000491-190729133359585-oozie-oozi-B [10:45:50] elukey: from the logs, hive failed to read its mapreduce plan from files - but the weird thing is that data has been deleted on the way! [10:46:04] raw data? [10:46:08] yes [10:46:21] sudo -u analytics kerberos-run-command analytics hdfs dfs -du -s -h /wmf/data/raw/webrequest/webrequest_test_text/hourly/2019/08/16/* [10:47:52] strange [10:48:27] elukey: I think it's probably related to me not correctly restarting actions usong command-line, but eh, if we drop raw data because of that, it's bad ! [10:48:47] it definitely is :( [11:02:47] elukey: ok - think I have understood why data got deleted [11:13:28] what happened? [11:15:01] I didn't use the "--nocleanup" flag, and the raw data folder is written as an output of the dataset (since the partitioned flag is written onto it) [11:15:28] ah ok so no blocker for kerb right? [11:15:48] When reran without "--nocleanup", oozie job deletes the output-folders defined in dataset [11:15:54] nope, unrelated [11:16:05] But still we probably could do with a correction :) [11:16:26] Will provide a patch [11:16:48] super [11:21:27] (03PS1) 10Joal: Update webrequest load bundle dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531467 [11:23:57] elukey: It looks like hdfs doesn't have a kerb cred - is that true? [11:25:16] joal: on what host? [11:25:28] client (an-tool1006) [11:26:55] ahhh they keytab! [11:27:12] so yes the hdfs keytabs are now ony deployed where the daemons run [11:27:18] only analytics is deployed to clients [11:27:28] elukey: I can actually do without, but at some point it might come necessary :) [11:27:55] Or, that means we need to connect to master to sudo as hdfs [11:28:00] Probably better this way :) [11:28:38] joal: exactly this was my idea, but let me know if it is cumbersome or not [11:28:45] super fine :) [11:28:49] my idea was to reduce the hdfs usage to admin actions [11:28:49] It's actually better [11:28:55] and use analytics for all the job-related stuff [11:29:19] elukey: to test, I need to cheat somehow, so hdfs user is handy, but no big deal :) [11:29:41] ahahha yes yes [11:29:55] if you need it all the workers should have it, including masters [11:30:02] ack! [11:30:02] available with the same kerberos-run-command [11:30:33] joal: how is it going so far dealing with auth etc..? [11:30:38] super cumbersome or bearable? [11:30:51] bearable - I'm getting used to kerberos-... [11:31:02] that can be changed to something handier of course [11:31:13] And a single kinit at the beginning of session is not difficult [11:31:15] if we come up with a better/more-flexible solution [11:33:05] elukey: if we find a command-name that can autocomplete after 1 or 2 chars instead of 4, it's a win :) [11:33:29] ack makes sense [11:33:32] elukey: krbr-run [11:33:41] or even krb-run [11:39:28] yup - looks like 'kr' is not yet a command header in the client [11:39:54] elukey: I could go with 'krb-run', an [11:40:14] and use the current user as principal instead of having to pass it (if feasible) [11:41:02] adding an extra param to pass a different principal: sudo -u hdfs krb-run --pcpl analytics hdfs dfs -ls [11:41:17] or sudo -u analytics krb-run hdfs dfs -lsn [11:42:12] ack [11:42:37] But we're on details here - everything is working asesome so far :) [11:43:31] ok taking a break - see you in a bit :) [11:46:48] o/ [11:55:35] (03CR) 10Elukey: [C: 03+1] Update webrequest load bundle dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531467 (owner: 10Joal) [11:55:38] 10Analytics, 10Analytics-Data-Quality: Actor table copy in the Data Lake contains mangled usernames - https://phabricator.wikimedia.org/T230915 (10Neil_P._Quinn_WMF) [12:01:17] 10Analytics, 10Analytics-Data-Quality: Actor table copy in the Data Lake contains mangled usernames - https://phabricator.wikimedia.org/T230915 (10Neil_P._Quinn_WMF) [12:01:48] 10Analytics, 10Analytics-Data-Quality: Import of MediaWiki tables into the Data Lakes mangles usernames - https://phabricator.wikimedia.org/T230915 (10Neil_P._Quinn_WMF) [12:04:03] 10Analytics-Kanban, 10Product-Analytics: Address data quality issues in the mediawiki_history dataset - https://phabricator.wikimedia.org/T204953 (10Neil_P._Quinn_WMF) [12:04:05] 10Analytics, 10Analytics-Data-Quality: Import of MediaWiki tables into the Data Lakes mangles usernames - https://phabricator.wikimedia.org/T230915 (10Neil_P._Quinn_WMF) [12:13:09] * elukey lunch! [12:24:25] Hi! I have no idea if this is the place to ask but if I have a request id from logstash could the request parameters be extracted from hadoop for us? [13:14:24] tarrow: hi! we can help but we'd need more details [13:25:50] elukey: great! basically we want to know the likely interface language of that request + the user defined languages [13:26:32] I guess this is us sort of showing how our logging wasn't as comprehensive as we thought [13:27:01] and now we're trying to reverse engineer the reasons a certain request failed [13:27:12] tarrow: we'd need to figure out what dataset we should query, and if we have such dataset :) [13:27:21] so can you describe what you'd need? [13:28:17] as reference https://wikitech.wikimedia.org/wiki/Analytics#Datasets [13:28:36] thanks! I'll look and try to figure out my question :P [13:29:08] hola team EUROPE [13:31:21] o/ [13:35:05] I think we'd want as much data as we can to try and determine the possible languages that request might have tried to get data in. So we'd probably want from the webrequest datalake the URI, and accept_language header, the pageview_info and the x_analytics map [13:35:54] super, do we have a timeframe and a request patter? [13:35:58] *pattern? [13:36:04] tarrow: it is worth thinking that there is no post data in hadoop (just urls) [13:36:17] tarrow: no post payload .. etc [13:37:04] We only have 1 request in fact: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.08.21/mediawiki?id=AWyz1kspZKA7RpiruGOj&_g=h@44136fa [13:38:41] tarrow: those ids are logstash identifiers right? [13:39:38] they are [13:40:49] I was hoping, perhaps in vain, that the reqId to unique_id also exist in hadoop [13:41:18] tarrow: it does not [13:42:02] ah [13:42:12] well we can check what requests to wikidata.org/wiki/Q666 happened in the timeframe happened right? [13:42:21] yes! [13:42:29] I was just typing out if I could ask that [13:53:34] 10Analytics, 10Reading Depth: Publish aggregated reading time dataset - https://phabricator.wikimedia.org/T230642 (10Nuria) @Groceryheist the data stream is been deactivated (this means that instrumentation is not sending data, but instrumentation is still present) >One thing I can use help with is the privac... [13:55:02] (03PS7) 10Mforns: [WIP] Add Oozie job for mediawiki history dumps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/530002 (https://phabricator.wikimedia.org/T208612) [13:56:22] (03CR) 10Mforns: [C: 04-2] "OK, readme added. But still, we need to agree on a final partitioning before we can merge." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/530002 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [13:57:18] tarrow: I found ~20 requests but around 2019-08-21T11:02:02Z [13:57:50] (03CR) 10Nuria: Update webrequest load bundle dataset (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531467 (owner: 10Joal) [14:00:00] elukey: I guess there is no more we can do to isolate them then? [14:02:12] Thanks for looking for me; it's a shame none of those ids are shared between the datalake and logstash; I didn't realise I'd be making you go digging :) [14:03:08] tarrow: what do you mean? That the timeframe is not right or that there are too many? I can give you the list and then you can check more if you want [14:04:06] The timeframe initially sounded like it was a minute and one second off [14:04:20] but I'd be more than happy to look through the 20 myself :) [14:10:20] elukey: next version of turnilo is nice, those people do such a good work! [14:11:55] yesss [14:13:44] elukey: jus send them kudos [14:15:19] nuria: so ok to switch turnilo too? [14:15:41] elukey: ya, let me look at one more thing [14:16:16] elukey: i will send an e-mail so people know is updated , superset is mostly identical, turnilo is a lot more feature-full [14:20:18] elukey: +1 to turnilo [14:22:22] nuria: ok so when the traffic team gives me the green light I'll swap the nodes [14:22:27] and also deploy superset [14:22:49] elukey: ok, are they doing something with those nodes? [14:23:22] nuria: swapping it now, what do you mean? [14:23:39] elukey: sorry, i think i missunderstood! please go ahead! [14:25:34] nuria: ahh ok! So basically I created a new VM in ganeti with buster and the new turnilo (an-tool1007), so Varnish will need to point to it. Then I'll decom analytics-tool1002 (the current one with stretch) [14:25:52] elukey: sounds great [14:25:59] elukey: will send e-mail [14:26:53] elukey: one sec [14:27:12] elukey: the db on the superste i was testing was not up to date with prod though [14:27:20] !log swap turnilo backend in varnish from analytics-tool1002 to an-tool1007 [14:27:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:27:38] nuria: yeah it might not have but by some days, not more [14:27:48] I took the snapshot before starting [14:27:51] elukey: it was an older copy, so the final host cannot be an-tool1005 right? [14:28:24] nono an-tool1005 is staging, the prod host is analytics-tool1004 [14:28:58] elukey: ah ok, it is just turnilo getting swapped [14:29:09] yes yes correct [14:29:15] sorry too many an-tool references :) [14:33:20] joal, the mediawiki dumps job took about 3.5 hours when I was trying to repartition using outputFileParts=16 in Spark [14:33:56] now I'm using outputFileParts=1, because the archive_job_output subworkflow needs just 1 file [14:34:09] and it's taking longer... is that expected? [14:36:09] new version of turnilo deployed! [14:36:18] will decom the old host tomorrow if nothing comes up [14:37:40] (03CR) 10Joal: Update webrequest load bundle dataset (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531467 (owner: 10Joal) [14:38:06] nuria: hi :) [14:38:14] nuria: ehat are turnilo new features? [14:38:26] joal: heatmaps [14:38:31] 10Analytics, 10Patch-For-Review: Upgrade Turnilo to its latest upstream - https://phabricator.wikimedia.org/T230709 (10elukey) New version of turnilo deployed, will leave it running for a day before decommissioning analytics-tool1002 :) [14:38:46] https://usercontent.irccloud-cdn.com/file/0dvqhgqQ/Screen%20Shot%202019-08-21%20at%204.38.13%20PM.png [14:38:53] joal: see identified bots per country per hour [14:39:33] so cool ! [14:40:56] mforns: less output-files means less parallelization, and even if it's only done last step, it means the last computation step is made, for all a partition, on a single worker (instead of 16 in the previous case) [14:41:13] mforns: so you can say that yes, longer seems natural [14:42:04] :( yea, I suspected that [14:42:33] (03CR) 10Elukey: [V: 03+2 C: 03+2] Upgrade turnilo to 1.17.0 [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/530813 (https://phabricator.wikimedia.org/T230709) (owner: 10Elukey) [14:42:52] joal, can't we change the archive_job_output subworkflow to accept multiple files and concatenate them? [14:42:54] (03CR) 10Elukey: [V: 03+2 C: 03+2] Release version 0.34.0rc1-wikimedia [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/529936 (owner: 10Elukey) [14:43:38] joal: for this codereview: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/531467/ [14:44:22] joal: why is it that the raw output needs to be defined at all? [14:44:34] a-team: deploying superset [14:58:11] nuria: the 'raw_output' is defined so that it could be used in jobs using raw webrequest other than load (which we haven't) [14:58:51] so there is a problem with superset [14:58:59] that didn't show up in staging [14:59:20] nuria: I think it's a leftover of when we had multiple bundles (load+refine) [14:59:26] oh elukey ? [14:59:41] joal: ah ok, so we can remove it entirely then [14:59:46] elukey: yesss... [15:00:22] nuria: we can remove it from the coordinator.xml and from the dataset, but if at some point we use the raw-webrequest, it'll need to be back [15:00:24] joal: if we can remove it then there is no room for error right? [15:00:36] joal: ok, let's add it then [15:01:08] nuria: I suggest we keep it in datasets_raw.xml (it's defined), either remove from output or keep it commented as it is [15:01:30] I commented it to try to keep some kind of knowledge of the --nocleanup trick [15:01:56] joal: i think code that is not used is better removed that commented out, so let's remove it from output? [15:02:04] By default hue check the box (no cleanup) when rerunning, so no issue has occured so far, but it could! [15:02:12] nuria: works for me :) [15:02:30] joal: thank you [15:02:30] I'm going to add a comment in datasets_raw.xml [15:02:40] joal: +1 [15:07:37] (03PS2) 10Joal: Update webrequest load bundle dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531467 [15:07:52] nuria: I actually kept the comment at the same place, as adding the hive partition is only done at thast place [15:09:02] (03CR) 10Nuria: [C: 03+2] Update webrequest load bundle dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531467 (owner: 10Joal) [15:09:18] elukey: let us know if you need help with superset [15:10:39] nuria: yeah I think something is off, plus I made a horrible mistake, and now I get your comment about the staleness of the superset status on an-too1005 [15:10:52] so I dumped the 'superset' database, not the superset_production [15:11:03] superset was the one that we had when there was no staging [15:11:11] and of course I did the same this time before upgrading [15:11:23] the last good backup is end of May for superset_production [15:11:31] that is not terrible but now I feel a complete idiot [15:20:09] triple time stupid since we have backups of the mysql database [15:24:59] elukey: I don't really know the details here, but I _am_ extremely confident that you're not stupid. 😊 [15:25:31] sending good vibes your way [15:25:41] thanks for the support neilpquinn :) [15:26:08] elukey: of course, thank you for constantly and silently taking care of our infrastructure :) [15:27:39] (03CR) 10Joal: [V: 03+2] Update webrequest load bundle dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/531467 (owner: 10Joal) [15:29:13] neilpquinn: jajaja, we all agree on that [15:29:43] so I retried again to deploy 0.34 on analytics-tool1004 [15:29:57] the database upgrade goes fine, same as I did in staging [15:29:59] no issue [15:30:15] but when one tries to login, there are redirects happening [15:30:25] super weird thing, I didn't see them in staging [15:30:26] elukey: ohhh [15:30:41] elukey: even after you clear your cookies blah balh [15:30:58] elukey: ya, i did not see those testing either [15:31:05] nuria: yeah tried with incognito but same issue [15:31:13] elukey: let me look at logs [15:31:24] so it might be the new version of FlaskAppBuilder, that was bumped [15:31:31] (it contains also my fix for the first login) [15:31:43] elukey: right [15:31:43] logs are silent sadly [15:32:09] elukey: even apache ones? [15:35:01] seems so yes [15:35:54] elukey: lf you give me permits to nuria@analytics-tool1004:/var/log$ cd apache2 i can do some digging [15:40:00] nuria: done in your home dir [15:40:00] 10Analytics, 10Product-Analytics, 10Reading Depth, 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): Reading_depth: remove eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10phuedx) A slightly prettier Grafana graph πŸ˜‰: https://grafana.wikimedia.org/d/... [15:40:08] 10Analytics: Turnilo : Issue with Is Deleted and Is Reverted dimensions when added as Split - https://phabricator.wikimedia.org/T230853 (10JAllemandou) Hi @Mayakp.wiki - Thanks for reporting! I can't say if it's related to the Turnilo new version Luca just deployed (thanks @elukey :) but it seems fixed for me.... [15:48:06] 10Analytics, 10Better Use Of Data, 10EventBus, 10Reading-Infrastructure-Team-Backlog: Create client side error schema - https://phabricator.wikimedia.org/T229442 (10Jhernandez) p:05Triageβ†’03High [15:53:06] something totally crazy [15:53:07] ssh -L 8088:analytics-tool1004.eqiad.wmnet:80 analytics-tool1004.eqiad.wmnet works [15:53:15] so even staging has the problem [15:53:35] elukey: meaning that you get redieected on 8088? on localhost? [15:54:20] nono with the tunnel I ean [15:54:22] *mean [15:54:27] the redirect doesn't happen [15:54:59] elukey: makes sense cause redirect is related to prior login, right? [15:54:59] what I do with it is to tunnel localhost:8088 to port 80 of analytics-tool1004.eqiad.wmnet [15:55:17] elukey: and on that domain there is no prior login [15:55:19] it makes little sense to me at the moment [15:55:27] elukey: jajajaja [15:55:35] elukey: ayayaya, right cause you cleared cookies [15:55:46] elukey: mmmm [16:06:41] 10Analytics, 10Editing-team: Deletion of limn-edit-data repository - https://phabricator.wikimedia.org/T228982 (10Neil_P._Quinn_WMF) @fdans I checked it over and we can archive/delete itβ€”there's nothing we need to keep. [16:17:16] 10Analytics, 10Editing-team: Deletion of limn-edit-data repository - https://phabricator.wikimedia.org/T228982 (10Nuria) Repo can be archived then, tagging #release-engineering-team to archive this repo [16:25:29] ok superset should be working now [16:34:29] elukey, superset seems to work fine for me, the only thing I get is a warning saying: "There was an issue fetching the favorite status of this dashboard." [16:37:49] mforns: thanks! [16:38:48] elukey: on brief inspection , all looks good [16:39:28] fiuuuu [16:39:31] nuria: thanks :) [16:45:31] need to step afk for ~15 mins [17:07:53] 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Nuria) As I mentioned on my prior post I do not think is such a great idea to release +5 and 100+ counts separately. Also , to cor... [17:13:59] 10Analytics, 10Product-Analytics, 10Reading Depth, 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): Reading_depth: remove eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10Jdlrobson) [17:13:59] 10Analytics, 10Product-Analytics, 10Reading Depth, 10Patch-For-Review, 10Readers-Web-Backlog (Readers-Web-Kanbanana-2019-20-Q1): Reading_depth: remove eventlogging instrumentation - https://phabricator.wikimedia.org/T229042 (10ovasileva) [17:20:00] (03CR) 10Nuria: [C: 04-1] [WIP] Publish monthly geoeditor numbers (035 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/530878 (https://phabricator.wikimedia.org/T131280) (owner: 10Milimetric) [17:20:30] mforns: please tell me what you think of my comments on this patch: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/530878/ [17:21:24] 10Analytics, 10Analytics-Kanban: Roll restart all openjdk-8 jvms in Analytics - https://phabricator.wikimedia.org/T229003 (10Nuria) 05Openβ†’03Resolved [17:23:29] 10Analytics-Kanban, 10Patch-For-Review: Upgrade superset to 0.34 - https://phabricator.wikimedia.org/T230416 (10elukey) [17:24:28] 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics: Superset Updates - https://phabricator.wikimedia.org/T211706 (10elukey) [17:27:53] * elukey off! [17:29:39] 10Analytics, 10Analytics-Kanban: Load Netflow to Druid - https://phabricator.wikimedia.org/T225314 (10Nuria) @ayounsi: this is the available data now, * I think* that maybe you added more dimensions after loaded was set up, if so be so kind to open another ticket with the additional dimensions, adding dimensio... [17:38:04] 10Analytics, 10Analytics-Kanban: Load Netflow to Druid - https://phabricator.wikimedia.org/T225314 (10Nuria) Ahem, pertinent link to the data: https://turnilo.wikimedia.org/#wmf_netflow [17:38:30] 10Analytics, 10Analytics-Kanban: Load Netflow to Druid - https://phabricator.wikimedia.org/T225314 (10Nuria) 05Openβ†’03Resolved [17:38:49] 10Analytics, 10Analytics-Kanban: Make timers that delete data use the new deletion script - https://phabricator.wikimedia.org/T226862 (10Nuria) 05Openβ†’03Resolved [17:39:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: API Request for unique devices for all wikipedia families is only showing data up to November 2018 - https://phabricator.wikimedia.org/T229254 (10Nuria) Closing ticket but cc-ing @JAllemandou per our conversation the other day. [17:40:10] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: API Request for unique devices for all wikipedia families is only showing data up to November 2018 - https://phabricator.wikimedia.org/T229254 (10Nuria) 05Openβ†’03Resolved [17:40:40] 10Analytics, 10Analytics-Kanban: Oozie queries that use 'reflect("org.json.simple.JSONObject"...' need refinery_hive jar - https://phabricator.wikimedia.org/T229669 (10Nuria) 05Openβ†’03Resolved [17:41:12] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Set up a generic workflow to create Kerberos accounts - https://phabricator.wikimedia.org/T226104 (10Nuria) 05Openβ†’03Resolved [17:41:14] 10Analytics: Enable Security (stronger authentication and data encryption) for the Analytics Hadoop cluster and its dependent services - https://phabricator.wikimedia.org/T211836 (10Nuria) [17:42:27] 10Analytics, 10Analytics-Kanban, 10Operations, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10Nuria) 05Openβ†’03Resolved [17:42:30] 10Analytics, 10Operations, 10Security, 10Services (watching), 10Wikimedia-Incident: Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap - https://phabricator.wikimedia.org/T226808 (10Nuria) [17:43:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move reportupdater queries from limn-* repositories to reportupdater-queries - https://phabricator.wikimedia.org/T222739 (10Nuria) Ping @fdans what does remain to be done in this task? [17:47:49] 10Analytics, 10Analytics-Kanban: MapChart zoom behavior broken - https://phabricator.wikimedia.org/T230062 (10Nuria) 05Openβ†’03Resolved [17:49:22] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Move refinery to hive 2 actions - https://phabricator.wikimedia.org/T227257 (10Nuria) @elukey: is the pull request the last job that needs to be moved? [17:49:28] 10Analytics: Turnilo : Issue with Is Deleted and Is Reverted dimensions when added as Split - https://phabricator.wikimedia.org/T230853 (10Mayakp.wiki) Hi @JAllemandou , I checked in Turnilo and the issue seems to be fixed now. Thanks for the update. [17:51:12] 10Analytics: Turnilo : Issue with Is Deleted and Is Reverted dimensions when added as Split - https://phabricator.wikimedia.org/T230853 (10Nuria) 05Openβ†’03Resolved [18:05:32] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Move refinery to hive 2 actions - https://phabricator.wikimedia.org/T227257 (10elukey) >>! In T227257#5429315, @Nuria wrote: > @elukey: is the pull request the last job that needs to be moved? yes exactly but not directly controlled by us (it runs under Neil's... [18:06:03] neilpquinn: o/ - can you check https://github.com/wikimedia-research/Audiences-External_automatic_translation/pull/1 when you have a min? :) [18:19:23] nuria: could you check https://phabricator.wikimedia.org/T220016 or forward me to someone to whom I can talk to? [18:20:18] raynor: hello, what are your questions? [18:21:26] we came up with new schema - MobileUITracking, we want to track user clicks on some elements. Previously we had MainMenu tracking schema. It was tracking only main menu clicks with 50% sampling [18:22:00] now, we want to track more elements (not only main menu, but also new menus (user menu,overflow menu), and additionally we want to track clicks on the toolbar [18:22:34] (watching/editing/etc), those actions happen much more often than someone opening the main menu and selecting option from there [18:22:53] we just don't want to melt servers, and because it's pretty difficult to come up with some numbers [18:23:16] it's the safest to roll-out with some small number, like 1% sampling rate, and then bump it if everything is fine [18:24:11] raynor: is this to be deployed in all wikis? [18:24:34] yes [18:24:37] raynor: 1% is much too much if so [18:24:58] raynor: i suggest 0.01% let it bake for a day and we shall see [18:24:59] it's only mobile wikis [18:25:10] no desktop [18:25:25] raynor: right, the bulk of our traffic is mobile [18:25:27] sorry, wrong wording.. it's both mobile and desktop wikis, but only Minerva skin [18:25:37] desktop minerva is very small, almost nothing [18:26:09] raynor: in the absence of any prior measures i suggest 0.01% and we can look at data , did you tested in beta labs? [18:26:31] the patch is not merged yet [18:27:16] raynor: then let's test in beta labs, please see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster [18:27:18] so the idea is - we had Schema MobileMainMenuTracking, we dropped that schema and introduced new one (MobileUITracking). The MobileUITracking is currently tracking the same stuff as the MainMenu [18:27:38] raynor: once that is working well we can sample at 0.01% and see data [18:27:41] with the same sampling rate, and it's ok, now we want to add more things (like tracking icons on the toolbar). [18:28:36] nuria: the schema works, it's 50% for MainMenu, we don't need to test it. What we want to do - is to track more actions [18:30:05] now we want to track things that are not hidden in the main menu, and most probably are more popular (like "watchstar icon" or "edit icon") [18:30:07] raynor: the schema might work, but in the absence of having tested in betalabs you do not know whether you have problems validating events, makes sense? also problems in instrumentation that might create tons of events when it should not be the case, both are problems thta might happen [18:30:09] *that [18:30:53] raynor: do you see a problem with testing all possible events in beta? [18:31:15] we will check it on beta cluster, that's for sure. I don't think it's necessary to test all possible events. [18:31:37] the code behind it is very simple, when you click on something it checks if HTML element has `data-event-name` attribute [18:32:00] if it has, it triggers analytics event, with name that is stored in `data-event-name`. That's all, no other use cases [18:32:33] raynor: right, my advise it check in beta, have a a sampling rate of 0.01% look at data for a day and then decide (with the help of a data analyst) how much data you need [18:33:10] ok, sounds good to me, could you post that in the phab ticket please? [18:34:06] also, 0.01% sounds really small number, those actions do not happen per each page view. Guts says that this will generate less events than NavigationTiming [18:34:34] raynor: You might be right, let's look at data and see. [18:34:49] if could go 0.1% that would be much better, bit more data. now we're tracking main menu with 50% [18:35:07] also, most of the events (user menu, overflow menu) are feature flagged, those are visible only to AMC users [18:36:05] raynor: lower sample rates do not cause problems, it is easy to increase them if needed. [18:36:58] raynor: also, this schema: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=MobileWebMainMenuClickTracking [18:37:11] raynor: does not seem active, did i got the name wrong? [18:37:37] raynor: ah i see, was disabled 7/26 [18:37:40] that makes sense. I just want to save our time on bumping the sampling rate step by step. AMC is not that popular yet so the user base is pretty small. I'm just afraid that 0.01% won't give us enough data [18:37:58] yes, so MobileWebMainMenuClickTracking is decommisioned, we don't use it any more [18:38:09] now we use MobileWebUIActionsTracking [18:38:34] we wanted to track much more things, using `MainMenu` schema just sounded bad [18:38:52] raynor: this one? https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&from=now-30d&to=now&var-schema=MobileWebUIActionsTracking [18:38:56] it's mostly renaming the schema/updating properties so the new schema fits us better [18:39:05] yes, this one [18:39:13] and it's sampled 50% for the main menu [18:39:22] raynor: can you explain those peaks of events? [18:39:42] raynor: almost 5x the normal flow? [18:40:05] raynor: they look a bit strange [18:40:16] now we will add more things to track - two new menus (that are available only to AMC users, very small subset) and icons on toolbar -> this is available to everyone, also to anon users. And honestly the toolbar actions are the only one that made us worried about sampling rate [18:41:11] raynor: how many more events are you sending than before? [18:41:27] no, I have no idea why we have those peaks, code is pretty simple, => if there is click event on HTML element with data-event-name attribute, send the event. That's the all logic behind it [18:42:24] how many more? previously we were tracking clicks on ~8 elements. now it will be ~30 elements [18:43:03] but some some elements (toolbar icons) are much more popular than other (menu). From our research some time ago looks like Main Menu is not that popular [18:44:12] Ok. Nuria, I think you're right. Let's go with 0.01%, and then we will bump it, I just need to confirm with Mneisler that she is ok with such path [18:44:24] raynor: a back of the napkin calculation makes those 800 events peaks (which might indicate some issue, not sure) maybe 30,000 events which is sizable so 0.01% makes sense [18:44:54] and then we will be bumping the rate when everything is fine. [18:44:58] raynor: i would also look into the nature of the data sent at those peaks it might indicate an instrumentation issue. [18:45:28] hey mforns - Did the thing work as exepected? [18:45:33] could you just leave a note here (https://phabricator.wikimedia.org/T220016), that would help me explain why we started with 0.01% [18:47:18] nuria, for MainMenu - https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=MobileWebMainMenuClickTracking&from=1559414775396&to=1562006775397 we had similar peaks [18:48:08] not that often though [18:48:40] raynor: just did with graph and question about peaks [18:48:52] ok, awesome, thank you! [18:49:33] raynor: those peaks indicate an issue with instrumenting code, i would advice to look a bit into it. probably you can get a lot of clues from the data sent at that time [18:51:33] we will check it, thanks [18:54:13] joal, hey! YES [18:54:21] \o/ !! [18:54:29] Awesome :) [18:54:43] Things work as they should, can we say :D [18:56:11] I'm executing the full job righ tno [18:56:14] right now [19:43:02] 10Analytics, 10Analytics-Kanban: Load Netflow to Druid - https://phabricator.wikimedia.org/T225314 (10ayounsi) Thanks! Luca opened (and solved) T229682 for the extra dimensions. Special casing is not important enough for now, thanks!