[01:19:03] 10Analytics: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10Nuria) [01:50:49] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10Nuria) [01:52:15] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10Nuria) This smells like bot activity.... [02:16:12] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Set up automated email to report completion of mediawiki_history snapshot and Druid loading - https://phabricator.wikimedia.org/T206894 (10Neil_P._Quinn_WMF) 05Resolved→03Open >>! In T206894#5022937, @Nuria wrote: > @mforns resta... [04:26:43] PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics [05:55:53] good morning camus! [06:22:16] I am a bit confused about the error message [06:35:41] I don't see any clear failure in either camus eventlogging or refine eventlogging [06:36:01] and I do see the _REFINED flag if I check some of the alarming topics [07:16:16] also I can see that in the camus logs there is a "CamusPartitionChecker$:281 - Flagging imported partitions for etc..." [07:16:25] after the error [07:18:39] I am surely missing something [07:37:37] one thing that could be useful to add to CamusPartitionChecker's error message is the values that lead to it [07:39:41] so to recap: [07:40:44] 1) camus-eventlogging (better - its checker job) reports some failures in importing data from some topics, like NavigationTiming and VirtualPageview (this is on purpose, those are whitelisted by the checker) [07:42:23] inspecting the raw data via file system doesn't reveal any clear "there is so imported data in here" situation [07:42:39] 2) monitor eventlogging refine reports also missing refined data [07:43:43] I've manually re-run it, and another email came [07:43:45] The following dataset targets in /wmf/data/raw/eventlogging between 2019-04-16T03:13:39.778Z and 2019-04-17T03:13:39.778Z have not yet been refined to /wmf/data/event [07:44:09] among all [07:44:09] `event`.`NavigationTiming` (year=2019,month=4,day=16,hour=19) [07:44:12] `event`.`NavigationTiming` (year=2019,month=4,day=16,hour=20) [07:44:15] `event`.`NavigationTiming` (year=2019,month=4,day=16,hour=21) [07:44:18] `event`.`NavigationTiming` (year=2019,month=4,day=16,hour=23) [07:44:21] `event`.`NavigationTiming` (year=2019,month=4,day=17,hour=0) [07:44:24] but then [07:45:02] ls /mnt/hdfs/wmf/data/event/NavigationTiming/year=2019/month=4/day=16/hour=*/_REFINED [07:45:11] shows all REFINED flags [08:07:33] the only interesting thing that I can see in the refined data is that the hours highlighted by the alarm have more snappy files [08:07:41] usually there are ~3 for each hour [08:07:48] on those hours there are 9/12 [08:07:55] (smaller files) [08:18:09] timing overlaps partially with my roll restart of kafka jumbo [08:57:30] * elukey is playing with pyspark2 [08:57:41] the sql context is really nice [09:02:19] wow https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-24h&to=now&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=All [09:06:21] !log restart eventlogging on eventlog1002 due to errors in processors and consumer lag accumulated after the last Kafka Jumbo roll restart [09:06:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:10:24] lag seems to start decreasing [09:14:25] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10akosiaris) >>! In T220661#5116643, @Ottomata wrote: > Been doing a lot to get more data, i... [09:38:38] now I am not really understanding the numbers of refined data [09:55:39] in any case, even if the lag is going down, I think that we are still lagging ~3M messages [09:56:04] so let's see how it goes up to the scheduled maintenance window for cdh 5.16.1 [10:06:22] elukey: I don't understand the issue :( [10:10:26] joal: bonjour! [10:10:33] Bonjour! [10:10:43] How are you today elukey ? [10:11:07] good! I have used pyspark2 and I definitely need to learn how to use it, it is magical [10:11:10] :D [10:11:18] (properly I mean) [10:11:34] elukey: plenty computers working on making a lot of data look small is awesome :D [10:11:52] yep! [10:12:08] about the issue: I have probably written non-sense up to now [10:13:19] hm - non-sense I wouldn't dare say so - Maybe in-course-of-understanding thought process :) [10:13:42] I didn't see any clear camus failure in importing data, and refined flags were set but monitor eventlogging refine has been alerting up to now [10:13:47] then I found the consumer lag [10:13:56] and restarted eventlogging [10:13:59] right [10:14:36] makes sense, we've been experiencing once, and refine-monitor has been put in place for that exact purpose: camus doesn't fail if no data is imported, nor refine [10:14:59] but I can see refined data for the hours that monitor refine complains [10:15:09] this was my major question mark [10:15:09] :( [10:15:17] indeed this is not cool [10:15:20] also refined flags etc.. [10:15:45] refined-flag is normal - refine actually succeeded - But probably refined too few rows [10:15:49] maybe this is not a clean 'no-data' failure, but partial? [10:15:55] exactly [10:16:09] Let's check that [10:16:13] elukey: batcave? [10:16:21] sure [10:39:40] * elukey lunch! [11:43:06] (03CR) 10Santhosh: "@chelsyx, why this patch not merged after +2 from Nuria?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499270 (https://phabricator.wikimedia.org/T218838) (owner: 10Chelsyx) [11:48:24] 10Analytics, 10EventBus, 10MediaWiki-Maintenance-scripts: showJobs.php maintenance script useless and misleading in production - https://phabricator.wikimedia.org/T221224 (10Lucas_Werkmeister_WMDE) [11:50:57] 10Analytics, 10EventBus, 10MediaWiki-Maintenance-scripts, 10Wikimedia-General-or-Unknown: showJobs.php maintenance script useless and misleading in production - https://phabricator.wikimedia.org/T221224 (10Reedy) [11:51:18] 10Analytics, 10EventBus, 10MediaWiki-Maintenance-scripts, 10WMF-JobQueue, 10Wikimedia-General-or-Unknown: showJobs.php maintenance script useless and misleading in production - https://phabricator.wikimedia.org/T221224 (10Reedy) [11:53:24] 10Analytics, 10EventBus, 10MediaWiki-Maintenance-scripts, 10WMF-JobQueue, 10Wikimedia-General-or-Unknown: showJobs.php maintenance script useless and misleading in production - https://phabricator.wikimedia.org/T221224 (10Lucas_Werkmeister_WMDE) > (Similarly, the [job queue health Grafana board](https://... [12:34:55] elukey: when you get a sec can you look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/504538 as a concept. i hoep to do a simlary thing for all *-canary roles defined in /etc/cumin/aliases.yaml [12:49:02] jbond42: sure! [12:49:14] I am wondering if we could just use aqs1004 in cumin aliases [12:51:07] elukey: that is how it is currently however i would like to expand the use of the canary role. it is currently used by debdeploy to test software updates however i wiould also like to use the canary role for testing puppe changes [12:51:40] e.g. i would upgrade puppet/facter on the canary roles and test there first [12:54:13] jbond42: fine for me, I didn't think that including a role into a role was contemplated in our current puppet guidelines, but if so +! [12:54:16] +1 [12:54:18] nothing against it [12:54:29] joal: o/ [12:54:45] any news from your investigation? [12:54:52] elukey: i belive the multiple role things is related to the role() function [12:54:56] and thanks [12:55:07] hi elukey [12:55:14] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Set up automated email to report completion of mediawiki_history snapshot and Druid loading - https://phabricator.wikimedia.org/T206894 (10JAllemandou) @Neil_P._Quinn_WMF : Indeed the validation job failed for expected reasons (a hig... [12:55:36] I now have different wonders (supposedly more precise, but not sure) [12:56:09] elukey: The topic in problem is event_logging_client_side - I can't recall what it does [12:56:38] joal: it is the one that varnishkafka-eventlogging pushes events to [12:56:47] elukey: Is that the one that gathers all event-logging-client-side events, before they get validated and sent to by-topic? [12:56:52] right [12:56:58] exactly [12:57:04] elukey: I had forgotten we camus this data [12:57:39] joal: we only do it for backup purposes, but with another instance of camus [12:57:47] I found that [12:58:13] o/ [12:58:27] morningggg ottomata [12:58:52] The other thing I found is that camus actually gets a lot of data, but doesn't `progress` in time, meaning it doesn't moves the offset to the last read [12:59:31] I am wondering if it is a matter of letting the el's processors to get to lag 0 [12:59:41] elukey: I think it'll work [13:00:15] joal: do you think that we should postpone the cdh upgrade? [13:00:59] maybe to tomorrow [13:01:05] just to be safe (if you have time of course) [13:02:00] elukey: the refine-monitor error was telling us that a lot of data was missing - The last email (sent at 9:15AM today, just AFTER the EL restart) shows a different time as the first non-refined forlder (hour 19, not 18), meaning that refine probably succeeded with the lag having been caught up [13:02:42] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10phuedx) From the SAL: > 15:21 otto@deploy1001: scap-helm eventgate-analytics finished > 15:21 otto@deploy1001: scap-helm eventgate-analytics cluster staging... [13:02:42] elukey: I think we'll be good to proceed an ~1h - Would that still be ok? [13:03:05] elukey: from a lag perspective I mean (1h) [13:05:39] joal: +1 [13:05:54] if we aren't good we postpone to another day [13:05:59] yessir :) [13:10:04] elukey: I feel it should be good - last alarm or refine-monitor not happy was at 9:15, and the last camus error at hour 9 as well - I feel confident the EL restart did solve the issue [13:10:23] (03PS4) 10Ottomata: Add ExternalGuidance event logging table to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499270 (https://phabricator.wikimedia.org/T218838) (owner: 10Chelsyx) [13:10:26] (03CR) 10Ottomata: [V: 03+2] Add ExternalGuidance event logging table to whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499270 (https://phabricator.wikimedia.org/T218838) (owner: 10Chelsyx) [13:10:49] (03CR) 10Ottomata: [V: 03+2] "Looks like it wasn't submitted for some reason, probably due to needing a rebase." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/499270 (https://phabricator.wikimedia.org/T218838) (owner: 10Chelsyx) [13:11:44] joal: just a nit - the refine monitor runs once a day (OnCalendar=*-*-* 04:15:00), the 9:15 occurrence was me restarting manually :D [13:11:56] Ah ! [13:11:59] but you are right [13:12:13] I think that we are out of the mud [13:12:44] elukey: let's wait for the lag to be fully recovered, for a camus run to have suceeded after, and then maybe we can restest manually before moving into upgrade? [13:13:55] elukey: I confirm that camus has been trying to reimport data (folders are bigger for the said hours, and checker has been validating same hours multiple time [13:14:46] +1 [13:16:09] (03PS4) 10Joal: Update mediawiki-history user bot fields [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504025 (https://phabricator.wikimedia.org/T219177) [13:17:47] (03CR) 10Joal: "Sorry for that - done" (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504025 (https://phabricator.wikimedia.org/T219177) (owner: 10Joal) [13:22:04] (03PS2) 10Joal: Fix mediawiki-user-history writing filter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504382 [13:23:19] (03CR) 10Joal: Fix mediawiki-user-history writing filter (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504382 (owner: 10Joal) [13:41:37] joal: lag is zero now [13:41:43] GO GO GO ! [13:41:45] :) [13:41:58] shall we restart refine monitor? [13:42:01] elukey: actually, when do the next camus run happens? at the hour? [13:42:09] checking [13:42:28] Wed 2019-04-17 14:05:00 UTC 22min left Wed 2019-04-17 13:05:00 UTC 37min ago camus-eventlogging.timer camus-eventlogging.service [13:43:15] I can force a run of camus-eventlogging.service [13:44:02] well let's see if refine monitor is still upset [13:44:06] worst that happens is an email :) [13:46:14] 19/04/17 13:45:23 INFO RefineMonitor: No dataset targets in /wmf/data/raw/eventlogging between 2019-04-16T09:44:13.268Z and 2019-04-17T09:44:13.268Z need refinement to /wmf/data/event [13:46:21] \o/ [13:46:31] YAY :) [13:46:36] nice :) [13:47:01] Now that everything is back on track, let's disable everything and make it late again ;) [13:47:19] exactly [13:47:20] :P [13:49:56] I have downtimed all the hosts for two hours [13:49:57] with [13:49:58] cookbook sre.hosts.downtime --minutes 120 'A:hadoop or A:druid or A:stat-hdfs or A:notebook or A:analytics-tools' -r "elukey - upgrading cdh to 5.16.1" [13:50:04] this is magic [13:50:24] :) [13:50:30] also puppet disabled [13:51:06] RECOVERY - Check the last execution of monitor_refine_eventlogging_analytics on an-coord1001 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_analytics [13:51:17] elukey: this is dark magic for me - TOO POWERFULL to be put in the wrong hands [13:53:22] and the following disables all the timers [13:53:22] cumin -m async 'A:hadoop-coordinator' 'systemctl stop camus-*.timer' 'systemctl stop hdfs-balancer.timer' 'systemctl stop refine*.timer' 'systemctl stop eventlogging_to_druid*.timer' 'systemctl stop monitor*.timer' 'systemctl stop drop*.timer' 'systemctl stop check_*.timer' 'systemctl stop sanitize_*.timer' 'systemctl list-timers' [13:58:02] ok let's now wait for yarn to drain [14:03:59] joal: all pyspark shells, shall we kill them? [14:04:38] elukey: I think Baho will hate us for killing recommender, but yes, we shall kill [14:05:38] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata) After talking with Alex, we will using: ` requests: cpu: 200m memory:... [14:06:17] :) [14:06:25] should I proceed or do you want to do it? [14:07:05] 10Analytics, 10Analytics-Kanban, 10Research, 10Article-Recommendation, 10Patch-For-Review: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10JAllemandou) Hi @bmansurov, I've been monitoring the current run of the recommender (https://yarn.wiki... [14:07:05] elukey: I can do it if it saves you time :) [14:07:23] nono doing it [14:07:49] Done :) [14:08:23] ahahah okok [14:08:25] thanks :) [14:08:56] proceeding then [14:09:10] joal: do you prefer to bat cave or just following in here? [14:12:03] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [14:12:27] PROBLEM - Oozie Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [14:12:36] this one was not downtimed, my bad [14:12:37] fixing [14:13:35] entering save mode and saving namespace [14:15:42] stopping all daemons [14:18:11] elukey: I'm following from here, but we can batcave if you prefere [14:18:31] hey guys [14:18:43] joal: good in here [14:18:45] I'm getting this error when try to connect to the notebook1003: channel 2: open failed: connect failed: Connection refused [14:19:05] 10Analytics, 10Analytics-Kanban, 10Research, 10Article-Recommendation, 10Patch-For-Review: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10bmansurov) @JAllemandou, that makes sense. I'm currently measuring the times spent running each operat... [14:19:11] dsaez: o/ I guess you missed my email about the hadoop cluster going in maintenace :D [14:19:22] oh [14:19:37] elukey, i see [14:19:50] dsaez: should be up in ~30 mins! [14:19:53] mandatory break! [14:19:54] (hopefully) [14:20:20] In fact, I don't have it [14:20:32] coming from you? [14:20:56] dsaez: I've sent it to analytics@ and engineering@, probably I should have included research, I thought that you were already in those :( [14:20:57] anyhow [14:21:01] will fix next time, sorry! [14:21:12] yep coming from me [14:21:12] sure! np [14:21:23] dsaez: https://lists.wikimedia.org/pipermail/engineering/2019-April/000695.html :) [14:21:24] I would go for a beer [14:21:30] I mean, coffee [14:21:35] ahahahaah [14:23:30] 10Analytics, 10Analytics-Kanban, 10Research, 10Article-Recommendation, 10Patch-For-Review: Generate article recommendations in Hadoop for use in production - https://phabricator.wikimedia.org/T210844 (10JAllemandou) Actually it'll not finish - We just killed it as we need to restart the cluster (planned... [14:23:47] * dsaez is joining the analytics mailing list [14:24:07] dsaez: engineering might also be good :) [14:24:47] elukey: hdfs command is broken for me - looks good ;) [14:25:17] joal: cluster completely down, doing backups now [14:25:24] awesome [14:25:30] joal, will do, but amount of emails received is inversely proportional with the atenttion spent :) [14:25:54] joal, I was up to trying to combine mwparserfromhell with the spark dumps [14:26:15] let see how complex this can become :D [14:26:19] absolutely true dsaez - I sort them in lists and screen most of them for keywords that my brain normally match :) [14:27:00] dsaez: That would be of great interest :) [14:27:44] maybe makes no sense, but I want to have a try. I'm playing with templates, and writing my own regex for that is very tricky. [14:28:11] indeed [14:28:37] dsaez: the thing I'm not sure of is if the lib is installed on the cluster [14:29:14] joal, yes, the first challenge is to load within the spark context. There is a method called addPyFile that suppose to work on those cases [14:29:58] dsaez: IIRC the parser is that fast because it relies on C libs that need to be installed on the machine - Maybe I'm wrong, but that's what I recall [14:31:00] IIRC ? [14:31:04] oh [14:31:10] If I remeber correctly [14:31:24] Yessir :) [14:31:32] let me see the code [14:33:14] not sure, but it's very likely [14:33:35] we need a sparkmwparserfromhell :D [14:35:36] currently upgrading the workers [14:47:49] elukey: hdfs is back ;) [14:49:29] I am currently upgrading the clients [14:49:34] and finally the coordinator [14:52:05] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Ottomata) Ok prod updated with new limits! I ran two abs from two different machines with... [14:52:14] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [14:54:19] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): EventGate service runner worker occasionally killed, usually during higher load - https://phabricator.wikimedia.org/T220661 (10Pchelolo) Agreed, let's do it! [14:58:17] joal: all done, checking time :) [14:58:22] ack ! [14:58:40] running home bbiab [14:59:00] RECOVERY - Oozie Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.catalina.startup.Bootstrap https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Oozie [15:04:21] (03CR) 10Milimetric: [C: 03+2] Update mediawiki-history user bot fields [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504025 (https://phabricator.wikimedia.org/T219177) (owner: 10Joal) [15:10:02] (03Merged) 10jenkins-bot: Update mediawiki-history user bot fields [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504025 (https://phabricator.wikimedia.org/T219177) (owner: 10Joal) [15:10:11] elukey: all good for me (spark2-shell, pyspark2, hive, beeline) [15:10:13] (03CR) 10Milimetric: [C: 03+2] Fix mediawiki-user-history writing filter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504382 (owner: 10Joal) [15:10:38] joal: \o/ [15:10:47] ~40 mins in total [15:11:06] (the upgrade procedure I mean) [15:11:15] joal: if you are ok I'd re-enable timers [15:12:12] elukey: testing a mapreduce, just to be sure [15:12:39] all good [15:12:57] super [15:12:57] elukey: hue error for me :( [15:13:08] elukey: not nice :( [15:13:36] let me have a conversation with hue [15:13:48] I wouldn't dare interfere [15:14:15] joal: try again [15:14:25] goood :) [15:15:25] elukey: all oozie coords seem happy [15:15:27] let's go :) [15:18:23] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10Nuria) Eventgate and eventlogging (despite name) share no infrastructure so those two events are unrelated. I wonder whether the rolling restart made the metr... [15:18:42] (03Merged) 10jenkins-bot: Fix mediawiki-user-history writing filter [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/504382 (owner: 10Joal) [15:19:16] seems all done [15:19:21] going to check druid to be sure [15:19:33] * joal claps elukey for a flawless upgrade :) [15:19:38] \o/ [15:19:44] let's wait a couple of hours though [15:19:47] NICE! [15:20:05] elukey: I agree druid can be a late failer [15:20:19] the procedure takes less than an hour now, and if we automate all of it we could go even less [15:20:40] (I imagine something like: "step x done, going to do y, is it ok? y/n" [15:21:27] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10Ottomata) @elukey, Nuria is right here, yes? If eventlogging-processors were stuck for a bit, then the per schema topics will jump up when raw client side ev... [15:21:30] (03PS9) 10Mforns: Add edit_hourly oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) [15:21:54] (03CR) 10Mforns: "This is ready for review after OK from product analytics." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [15:22:39] (03CR) 10Mforns: "This is ready for review after approval from product analytics." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501328 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [15:22:52] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10elukey) >>! In T221181#5119095, @Ottomata wrote: > @elukey, Nuria is right here, yes? If eventlogging-processors were stuck for a bit, then the per schema to... [15:40:32] (03PS5) 10Mforns: Add oozie job to load edit_hourly to druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501607 (https://phabricator.wikimedia.org/T211173) [15:41:38] (03CR) 10Mforns: "After OK from product analytics, this is ready for CR." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501607 (https://phabricator.wikimedia.org/T211173) (owner: 10Mforns) [15:58:08] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Add partial blocks to mediawiki history tables - https://phabricator.wikimedia.org/T211950 (10JAllemandou) Hi @nettrom_WMF - I have a test dataset for you that include this data (example in scala-spark2: ` val user_history = spark.read.parquet("/user/joal/... [16:02:08] ping ottomata standduppp [16:06:29] 10Analytics, 10Analytics-Kanban: Provide edit tags in the Data Lake edit data - https://phabricator.wikimedia.org/T161149 (10JAllemandou) Hi @Neil_P._Quinn_WMF - Test data is available :) Here is an example of accessing it in scala-spark2: ` val history = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/... [16:07:40] 10Analytics, 10Research: Check home leftovers of ISI researchers - https://phabricator.wikimedia.org/T215775 (10elukey) @leila any news? :) [16:24:11] 10Analytics, 10Analytics-Wikistats: Reindex mediawiki_history_reduced with lookups - https://phabricator.wikimedia.org/T193650 (10Milimetric) p:05Triage→03Normal [16:29:40] (03CR) 10Nuria: [C: 03+2] Add oozie job to load edit_hourly to druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501607 (https://phabricator.wikimedia.org/T211173) (owner: 10Mforns) [16:29:45] fdans i just heard that joke on the way out [16:29:46] LOVE IT [16:29:47] hahha [16:30:20] (03CR) 10Nuria: [C: 03+2] Add edit_hourly to list of tables to be purged of old snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501328 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [16:32:40] (03CR) 10Nuria: "Triple checking that this job pulls data once the mediawiki checker has run and succeeded" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [16:39:55] (03CR) 10Joal: "> Patch Set 9:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [16:44:50] ottomata: :D:D hell yea [16:46:02] (03CR) 10Joal: [V: 03+2] Add oozie job to load edit_hourly to druid [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501607 (https://phabricator.wikimedia.org/T211173) (owner: 10Mforns) [16:46:18] nuria: I assume you were meaning this --^ ? [16:46:50] (03CR) 10Fdans: [C: 03+2] "NIce one. I had incorporated this fix in the timeselector one, but it's better to have it as a separate change." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/504417 (owner: 10Milimetric) [16:47:10] (03CR) 10Fdans: [V: 03+2 C: 03+2] Fix problem with breakdown [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/504417 (owner: 10Milimetric) [16:47:56] mforns: I assume the edit-hourly job (https://gerrit.wikimedia.org/r/c/analytics/refinery/+/501197/8) is to be merged as well? [16:49:54] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10phuedx) There's a spike in `VirtualPageview` events today (see https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&from=1555459200000&to=155... [16:53:16] (03PS1) 10Fdans: Release 2.5.7 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/504612 [16:53:37] (03CR) 10Fdans: [V: 03+2 C: 03+2] Release 2.5.7 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/504612 (owner: 10Fdans) [16:53:37] joal, yes, there are actually 3 patches to merge:https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/501607/ [16:53:48] https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/501328/ [16:53:55] and https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/501197/8 [16:56:27] 10Analytics, 10ExternalGuidance, 10Product-Analytics, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review: Measure the impact of externally-originated contributions - https://phabricator.wikimedia.org/T212414 (10chelsyx) [16:56:29] 10Analytics, 10Product-Analytics, 10Patch-For-Review: Add ExternalGuidance event logging table to whitelist - https://phabricator.wikimedia.org/T218838 (10chelsyx) 05Open→03Resolved Thanks everyone! [17:03:16] * elukey off! [17:03:39] ebernhardson: o/ [17:03:53] whenever you have time let's re-create the venv [17:04:37] or better - I think that we can just rename the current venv in your home to something else, and then just via the jupyter ui you should trigger a recreation of the new one [17:04:41] IIUC [17:04:58] elukey: oh right, i totally spaced on that while working on a talk i have to give next week. Go ahead and rename, i already have a bunch of venv's [17:05:24] or actually i have to give the talk tomorrow as a dry-run on hangouts, and then give it to a conference hall next week :) [17:06:24] elukey: recreated [17:07:25] hmm, lemme see if jupyter recreates it on its own [17:08:06] logging in didn't create a venv, and gave the same 500 ISE as before [17:08:37] says something about logs for ebernhardson may contain details, but not sure where those logs end up [17:09:02] host:notebook1003 in logstash doesn't turn anything up [17:09:47] should be jupyter-ebernhardson-singleuser.service [17:10:55] so journalctl -u jupyter-ebernhardson-singleuser.service but not sure if you can read it.. [17:11:09] ebernhardson: did you create the venv the first time? [17:11:22] (I am a bit ignorant about notebooks) [17:11:26] elukey: tried both ways, both got the 500 ISE [17:11:33] journalctl doesn't like me: No journal files were opened due to insufficient permissions. [17:11:56] ah no wait [17:12:28] I just restarted the main jupyterhub service [17:12:32] can you retry? [17:12:47] elukey: now it started :) [17:12:48] I am following https://wikitech.wikimedia.org/wiki/SWAP#Resetting_user_virtualenvs [17:12:51] ah! [17:12:58] good :) [17:13:32] mforns: I actually have a bunch of questions on the definition of the dataset :( [17:13:51] joal, sure, you can -1 and we'll deploy next train [17:14:21] mforns: There actually is 1 naming thing that is a -1 indeed - Will do [17:14:33] I wouldn't have -1 otherwise, for questions only [17:14:35] ok, no problemo [17:15:24] (03CR) 10Joal: [C: 04-1] "Questions about definition and names. -1 because of the dt vs ts convention break." (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [17:15:36] thx joal! [17:15:58] mforns: I only reviewed the table definition here, when we have that bunch solved, I'll check the rest (oozie related technicalities :) [17:16:07] ok [17:16:09] Sorry for the late intervention mforns :( [17:16:16] no no, not late at all [17:16:21] dsaez, neilpquinn, ottomata, joal - if you have time can you review your home on notebook1003 and see if anything can be dropped? [17:16:36] bearloga as well please :) [17:17:23] elukey: looks good for me AFAICS [17:17:33] Gone for diner, will come back after [17:17:41] joal: good meaning nothing to drop? :) [17:17:45] done elukey [17:17:50] thanks! [17:22:01] (going off for real!) [17:23:37] elukey: hit me up tomorrow there appears to be some disk usage weirdness going on in my homedir [17:24:08] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10Ottomata) VirtualPageview counts in Hive should be normal, its only Kafka messages that should have delayed messages causing a later jump. If you see the sam... [17:36:20] (03CR) 10Mforns: Add edit_hourly oozie job (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [18:17:14] (03CR) 10Joal: [C: 04-1] Add edit_hourly oozie job (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [18:23:16] o/ nuria wanna finally talk array items type? [18:32:04] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Operations, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) [18:45:30] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Core Platform Team Kanban (Doing), 10Services (doing): Create scripts to estimate Kafka queue size per wiki - https://phabricator.wikimedia.org/T182259 (10mobrovac) p:05Triage→03Normal @Pchelolo let's add it to `ops/puppet`? [19:32:17] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10Nuria) Ok, will close ticket cause it seems pretty well stablished that this is related to consumers consuming at a higher rate after a restart. [19:32:25] 10Analytics, 10Product-Analytics, 10Readers-Web-Backlog: strange virtual pageview jump on 2019-04-16-03 - https://phabricator.wikimedia.org/T221181 (10Nuria) 05Open→03Resolved [19:43:49] 10Analytics, 10Analytics-EventLogging, 10MW-1.34-release, 10Technical-Debt (Deprecation): Remove deprecated EventLogging schema modules - https://phabricator.wikimedia.org/T221281 (10Krinkle) [19:44:02] 10Analytics, 10Analytics-EventLogging, 10MW-1.34-release, 10Technical-Debt (Deprecation): Remove deprecated EventLogging schema modules - https://phabricator.wikimedia.org/T221281 (10Krinkle) [19:44:30] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Operations, and 5 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) ganeti1001, 1002 and 2001 have been installed. I dunno what's up with ganeti2002. `gnt-instance console schema2002.cod... [19:53:19] hmm, elukey did you mean to enable profile::kerberos::client for all nodes in the analytics labs project? [20:03:25] 10Analytics: PHP serialization can contain null bytes - https://phabricator.wikimedia.org/T221283 (10Milimetric) [20:59:46] (03CR) 10Nuria: Add edit_hourly oozie job (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501197 (https://phabricator.wikimedia.org/T220092) (owner: 10Mforns) [21:53:52] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Set up automated email to report completion of mediawiki_history snapshot and Druid loading - https://phabricator.wikimedia.org/T206894 (10Neil_P._Quinn_WMF) Thanks, @JAllemandou! Sounds like a good plan. [21:59:46] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: "Edit" equivalent of pageviews daily available to use in Turnilo and Superset - https://phabricator.wikimedia.org/T211173 (10Neil_P._Quinn_WMF) Thanks, @mforns! Looks good to me as well. >>! In T211173#5116048... [22:34:04] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: "Edit" equivalent of pageviews daily available to use in Turnilo and Superset - https://phabricator.wikimedia.org/T211173 (10MNeisler) >> And will create another task to add the user_tenure_field and also the... [22:37:51] elukey: thanks for the ping—I cleaned up a bunch of my stuff :) [23:15:27] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Product-Analytics, 10Patch-For-Review: "Edit" equivalent of pageviews daily available to use in Turnilo and Superset - https://phabricator.wikimedia.org/T211173 (10kzimmerman) Thank you all! My understanding is that wider launch (sharing with stak... [23:59:38] hi so I want to get a pyenv set up on a stat machine