[05:57:50] (03CR) 10Elukey: [V: 032 C: 032] Upgrade to upstream 1.8.1 version [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/469198 (https://phabricator.wikimedia.org/T197276) (owner: 10Elukey) [06:07:55] morning! [06:08:07] so for the druid upgrade: [06:08:31] 1) I downloaded 0.11 debs to each host from apt so we have a quick rollback in case something goes wrong [06:08:40] 2) uploaded to apt 0.12.3-1 [06:08:47] 3) merged the turnilo changes [06:09:13] I'd say that at around 10 CEST we start stopping indexation jobs just to be sure [06:09:18] and then we proceed :) [06:33:39] (03PS1) 10Elukey: Add za.wikimedia to the pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/469557 [07:08:13] 10Analytics-Tech-community-metrics, 10Code-Health, 10Release-Engineering-Team (Kanban): Develop canonical/single record of origin, machine readable list of all repos deployed to WMF sites. - https://phabricator.wikimedia.org/T190891 (10Quiddity) [07:25:40] I'm with you elukey :) [07:31:16] joal: morning! So I might have found a way to integrate https://issues.apache.org/jira/browse/HIVE-12582 in our puppet repo, didn't think about it before [07:31:27] it should allow us to add the prometheus jmx exporter to hive server [07:31:33] but I need to test it [07:31:47] \o/ !! [07:31:58] Thanks for keeping searching elukey :) [07:32:22] elukey: questio [07:32:49] https://gerrit.wikimedia.org/r/c/operations/puppet/cdh/+/469499 has been merged - I'm assuming that in order to be dpeloyed, we need to update CDH submodule in main puppet? [07:38:01] joal: I think that andrew deployed it, lemme check [07:38:21] elukey: I'm sure he deployed one of the two (the previous one), but I wonder about that one [07:39:34] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469515/ [07:40:00] hm [07:40:17] you are not convinced :D [07:40:28] This is bizarre then - The supposedly created folder still doesn't exist on stat1004 for instance :( [07:42:42] Ah ! You've been faster for za.wikimedia [07:42:45] elukey: --^ [07:42:59] elukey: many thanks for that - It's my ops week, I should have done it yesterday [07:43:00] did earlier on :) [07:43:06] nah [07:43:35] elukey: Have you manually patch the list? [07:44:10] nope still haven't, was waiting for the review [07:44:26] mmmm it is strange that the tmp file is not deployed [07:44:28] right - will review and update the list (except if you have a strong will to do it :) [07:44:47] I have not :D [07:45:40] (03CR) 10Joal: [V: 032 C: 032] "Merging !" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/469557 (owner: 10Elukey) [07:46:01] elukey@cumin2001:~$ sudo cumin 'R:file = "/tmp/hive-parquet-logs"' 'ls -l' --dry-run [07:46:04] 57 hosts will be targeted: [07:46:06] an-coord1001.eqiad.wmnet,analytics[1028-1077].eqiad.wmnet,analytics-tool1001.eqiad.wmnet,notebook[1003-1004].eqiad.wmnet,stat[1004-1005,1007].eqiad.wmnet [07:46:09] DRY-RUN mode enabled, aborting [07:46:28] hm - That means the folder exists? [07:46:34] it is in the puppet catalog [07:47:09] ahem ... /me doesn't know what the puppet catalog does :( [07:47:09] but I can see it only on stat1005.eqiad.wmnet [07:47:30] right - I manually create that one as a test before the patch [07:47:55] ah ok [07:48:05] so the puppet catalog is basically a list of resources for each host [07:48:18] if you define a File in a manifest for example it gets added [07:48:32] ok [07:49:56] and those resources are automagically created on the host once called on the system? [07:50:29] yes they get created according to the config [07:50:38] is /tmp/hive-parquet-logs a dir right? [07:50:41] yes [07:51:20] I wonder if my patch was correct in that regard elukey [07:51:43] I don't remember having done anything to make puppet know it's a dir [07:51:50] let's try with https://gerrit.wikimedia.org/r/469563 [07:52:37] !log Manually add za.wikimedia to pageview-witelist (patch merged: https://gerrit.wikimedia.org/r/469557) [07:52:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:53:59] I am updating the submodules in puppet [07:56:17] joal: Notice: /Stage[main]/Cdh::Hive/File[/tmp/hive-parquet-logs]/ensure: created [07:56:20] stat1004 :) [07:56:39] Many thanks elukey :) [07:57:07] elukey@stat1004:~$ ls -l /tmp/hive-parquet-logs [07:57:07] total 0 [07:57:14] ok mistery solved :) [07:57:22] elukey: I knew it was bizarre that my puppet was correct at first step :) [07:57:53] elukey: now testing the reason for that folder to exist [07:58:04] ?? [07:58:38] \o/ it works :) [07:58:38] joal: this is my proposal for hive https://gerrit.wikimedia.org/r/469562 [07:58:55] elukey: if you ls now in the folder, you'll see :) [07:59:44] I'm going to let chelsyx and bearloga know - I think they'll be happy [07:59:53] ah ok you were saying "testing the new thing", I thought that you were looking why we had to add it :D [07:59:56] nevermind [07:59:59] :) [08:04:38] joal: let's stop indexations in say 10/15 mins? [08:04:51] elukey: checking current state [08:06:17] elukey: we are currently refining webrequest-hour-7, and indexing stuff for hour-6 [08:06:51] I suggest we wait for current indexations to be done, and suspend the jobs (to prevent hour 7 to start once refine is done) [08:07:27] elukey: only 2 jobs to suspend: webrequest and pageview - the other ones are daily [08:09:32] +1 [08:10:58] !log Suspend webrequest-druid-hourly and pageview-druid-hourly oozie jobs [08:10:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:11:02] elukey: --^ [08:11:21] super thanks! [08:15:13] I added the patch to the hive server sh file [08:15:26] so now I can test again in labs the prometheus stuff [08:16:10] awesome elukey :) [08:41:47] it works!!! [08:41:50] \o/ [08:42:02] ! \o/ [08:42:12] hiveServer2 metrics, here you come :) [08:42:23] sadly mbeans only for jvm [08:51:25] joal: ready for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469585/ [08:51:39] do you think that we could squueze in a hive restart? :P [08:52:14] elukey: I'm a squeezing master - no problem :) [08:54:13] all right merging changes, an-coord1001 with puppet disabled [08:54:23] I am seeing hive jobs in yarn uff [08:54:37] elukey: I don't [08:54:47] elukey: I htink now is actually good [08:54:55] goooood they are done! doing it [08:56:33] !log restart hive-server on an-coord1001 to pick up new prometheus settings [08:56:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:57:11] aaand hive -> show databases works fine [08:57:13] \o/ [08:57:38] elukey: we should test with beeline as well (to use haveserver2) [08:57:55] works :) [08:58:00] * elukey dances [08:58:29] the last daemon not covered seems oozie [08:58:45] elukey: indeed, the last daemon [09:01:34] joal: https://grafana.wikimedia.org/dashboard/db/analytics-hive [09:01:36] yessss [09:02:00] * joal bow to elukey - Master of metrics [09:02:09] \o/ [09:02:47] so now we can see if Neil's metric causes troubles [09:02:54] yes indeed [09:07:24] elukey: I also found config params that could be interesting in that regard [09:08:10] anything interesting? [09:08:33] possibly - need to check [09:14:03] joal: so we could deploy turnilo now and see if there is anything weird with druid 0.11 [09:14:21] ok elukey [09:14:27] all right [09:14:27] following you [09:15:03] elukey: after more and more reading, it seems that the memory limitation is actually due to a general hadoop setting - Need to triple check though [09:16:04] Turnilo (version 1.8.1) is open source under the Apache 2.0 license. [09:16:07] done :) [09:16:19] !log upgrade turnilo to 1.8.1 [09:16:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:16:52] elukey: I don't know it is related to turnilo update or druid being better, but man it;s snappy !!! [09:17:19] also elukey: [09:17:19] https://turnilo.wikimedia.org/#unique_devices_per_domain_monthly/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ACgqwATJXlUg7MGuiy4CVgIwBNSy0daQRMTEM7SSh5TXwfAF8AXSSmKE0kNEdnDW1uCyY7CDZsKE9TcyL9EABzd2wYBFoIDW5fAFko8Po8UENjAjN8lwhDcgwcbjgocmJsQuxqkHimJBZm/HqEBBSQNim3YkdQaDaGjHwpRChiVIgFhGCYbAgARxhDk0PWdCqWM4gVCAnq93lBPtEij9JMC3h [09:17:24] 9MFIpNd6EwYaCTE87AovpD5CBkkxNHcSHYACKVEqeLKJAlE4h2ADKXW4qI+2JYUJWxGqs0ieE2CCYlAg1UoSBF3X58SAA=== [09:17:27] oops sorry [09:17:51] elukey: https://gist.github.com/jobar/0725225f092bfe17038cf41bbaa75ef1 [09:17:54] \o/ !!!! [09:18:17] the x axis is fixed! [09:18:36] elukey: I think this is a strong yes dfor this new version :D [09:18:50] nice, kudos to them :) [09:19:03] it would be nice to give them some feedback [09:19:23] now IIRC marcel was joining for the druid upgrade at 12? [09:19:52] elukey: IIRC it was the opposite - Marcel wanted to join for turnillo :) [09:20:10] But yes timing was 12 [09:20:26] nono he wanted to see the druid upgrade [09:20:33] Ah ok - my bad :) [09:34:13] so my rsync for home dirs stat1005 -> stat1007 is still running from yesterday :P [09:41:31] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) Hi @mpopov! We are in the process of migrating everybody from stat1005 to stat1007 (not announced yet for users) but I am wondering if I could move the `statistics:... [09:43:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [09:43:42] brb in 15 mins :) [09:56:32] elukey: I THINK I FOUND IT ! [09:56:51] elukey: no bump in hiveserver2 memory usage when running the query [09:57:09] elukey: therefore must be a parameter - Well it is well hidden :) [09:58:08] elukey: https://hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml [09:58:15] elukey: look for mapred.child.java.opts [10:01:13] * elukey reads https://stackoverflow.com/questions/24070557/what-is-the-relation-between-mapreduce-map-memory-mb-and-mapred-map-child-jav [10:01:34] very interesting! [10:02:18] given the heap consumption values and this discovery I'd propose to lower down hive server/metastore Xmx [10:02:41] elukey: maybe not if we bump the mapred.child.java.opts? [10:03:20] sure one thing at the time, but I think that those heaps are huge [10:03:27] elukey: my view is that this parameter is not used by mapreduce per se but by the local task of hiveserver2 [10:03:36] I hear that elukey [10:04:02] the thing that I am worried about is that it is more an overhead than a benefit [10:04:10] but, let's first tweak mapred.child.java.opts [10:04:15] then we see how heap changes [10:04:34] have you issued a big query now? [10:04:41] I can see some interesting GC metrics [10:04:44] for server [10:05:01] elukey: nope, only small queries on my side [10:06:20] an interesting thing could be to apply CMS for old gen [10:07:26] yeah even cloudera reccomends it [10:10:09] anyway, given it is a bit late for our schedule I'd say to start the druid upgrade [10:10:17] then we'll fill marcel in when he joins [10:10:41] hey teamm [10:10:45] o/ [10:10:47] gooooodd [10:10:50] I was about to start :) [10:10:50] elukey: you know how to call for Marcel :) [10:11:03] sorry for 10 minutes late! [10:11:15] so following http://druid.io/docs/latest/operations/rolling-updates.html I'd start with historicals [10:11:19] one at the time [10:11:32] ok, are you guys in da cave, or just here?> [10:11:50] just here [10:11:52] k [10:12:45] ok so first thing I am installing druid-commons, that is the deb containing all the shared libs [10:13:01] that will be picked up by each daemon after the restart [10:13:25] and then now I just upgraded the historical on druid1001 [10:13:33] https://grafana.wikimedia.org/dashboard/db/druid [10:14:02] ok, I'm following that [10:14:22] so in the logs the daemon is loading the segments [10:14:31] like [10:14:32] 2018-10-25T10:14:26,018 INFO io.druid.server.coordination.SegmentLoadDropHandler: Loading segment[3688/3942][mediawiki_history_beta_2018-04-01T00:00:00.000Z_2018-05-01T00:00:00.000Z_2018-10-08T16:33:51.393Z_5] [10:15:44] seems ready elukey (from druid-coord UI) [10:15:46] memory-mapping? [10:15:58] super [10:16:45] I don't recall if it uses memory mapping explicitly or only leverages the linux page cache, I'd say the latter [10:18:07] historical on 1002 upgraded [10:19:32] loaded all segments [10:19:59] ok [10:20:46] coordinator's metrics looks good, joal ok to restart the last historical? [10:20:56] +1~ [10:20:57] ! [10:20:59] sorry :) [10:22:33] ok 1003 upgraded as well, segments loaded [10:23:12] now it is the turn of the overlords [10:23:21] cache size dropped as expected :) [10:24:34] mysql conns looks good [10:24:40] where do you see that joal, I can not see that in any grafana dashboard... [10:24:56] mforns: in grafan, historical section [10:25:13] turnilo serves data as well [10:25:17] So far so good :) [10:25:46] can see it now, was looking at druid_public (default) [10:26:12] overlords done, proceeding with middlemanagers [10:27:39] done [10:28:04] now the brokers, all good so far? [10:28:06] elukey: testing those will need indexation jobs :) [10:28:21] yep yep [10:28:37] in half an hour EL2Druid indexation will trigger [10:29:06] broker on 1001 is up [10:29:22] elukey: I forgot about EL2Druid ! [10:29:34] elukey: lucky we are, we could have broken those [10:29:45] I am sorry mforns :( [10:30:05] I suspended the oozie indexation jobs but did n't think of EL2Druid [10:30:11] broker on 1002 coming up [10:30:13] no prob, it was done for sure at 12:10h and won't start until 13:00h [10:30:55] (03PS8) 10Fdans: Add change_tag to mediawiki_history sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465416 [10:30:55] aaaand all brokers up [10:31:00] last ones, coordinators [10:31:03] and then we are done [10:31:17] :] [10:31:39] (03CR) 10Fdans: Add change_tag to mediawiki_history sqoop (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465416 (owner: 10Fdans) [10:32:06] (03CR) 10Fdans: [V: 032] Upgrade packages and commit package-lock to remove vulnerabilities [analytics/aqs] - 10https://gerrit.wikimedia.org/r/467733 (https://phabricator.wikimedia.org/T206474) (owner: 10Fdans) [10:32:59] so an interesting trick that moritz taught me is to use lsof -Xd DEL [10:33:02] when upgrading [10:33:29] this will tell you the files deleted but that are still referenced by a file descriptor, so used by a daemon [10:33:32] elukey --verbose [10:33:34] Ah [10:33:44] nice trick ! [10:33:53] for example, when I upgraded druid-common a ton of those popped up [10:33:57] and now they are gone [10:34:11] aha [10:34:43] everything upgraded! [10:35:23] !log upgraded Druid on druid100[1-3] to 0.12.3-1 [10:35:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:35:45] elukey: launching indexation back? [10:35:49] +2 [10:36:34] !log Resuming oozie webrequest and pageview druid hourly indexation jobs [10:36:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:36:46] Hallo. [10:37:31] overlord master is 1001 (if anybody wants to check the ui) [10:37:40] https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Druid [10:37:42] mforns: --^ [10:37:53] If I use mw.track() to log something to EventLogging, and there is a parameter that doesn't appear in the schema, what will happen? [10:38:15] elukey, yes, already on it [10:38:28] Will it be silently ignored? Or will the loggin fail? Or something else? [10:38:55] oh, no, I'm on the coordinator [10:39:13] also overlord [10:39:19] elukey: webrequest indexing task started [10:40:45] aharoni, if the field is optional, nothing will happen, the field will receive the value NULL in the DB [10:41:14] if the field is required, then validation will fail in the eventlogging_processor, and the event will not make it to the database [10:41:30] mforns: I think it;s the opposite - value seems not to be present in the schema [10:41:33] oh! [10:41:36] understand [10:41:47] ha good question [10:42:13] aharoni, if I recall correctly, it also fails validation [10:43:05] we had such a case a couple days ago, where some fields were renamed before sending the event, and thus recognized by the eventlogging_processor as not pertaining to the schema, and raising errors [10:46:28] mforns, joal - thanks [10:46:42] np :] [10:49:56] joal: indexations succeded right!!?? [10:49:59] seems all good [10:50:47] :] [10:51:00] ah wow query/cache/caffeine/delta/evictionBytes [10:51:15] this is something that we could add to the druid exporter [10:51:34] Nice [10:51:42] elukey: indeed, everything looks good :) [10:53:40] elukey: back to my hive issue [10:54:01] elukey: I think I have explanations, and those are really weird! [10:54:52] so druid upgraded! \o/ [10:56:40] joal: I am all ears, lemme know :) [10:57:35] elukey: actually I need again more investigations - But the issue seems realted to running hadoop in local mode, more than memory issues [11:03:13] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: turnilo x axis improperly labeled - https://phabricator.wikimedia.org/T197276 (10elukey) p:05Triage>03Normal [11:03:42] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: turnilo x axis improperly labeled - https://phabricator.wikimedia.org/T197276 (10elukey) Turnilo 1.8.1 has been deployed, as far as I can see the issue seems fixed! [11:07:33] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Upgrade to Druid 0.12.3 - https://phabricator.wikimedia.org/T206839 (10elukey) druid100[1-3] upgraded today, we'll proceed with druid public on monday if no issue will be registered! [11:07:47] joal: ok to upgrade druid public on monday? [11:07:51] if nothing comes up [11:07:56] yessir ! [11:09:48] EL2Druid indexations also successful [11:10:08] niceeee [11:23:08] Man - This hadoop debugging is hell [11:34:07] :( [11:35:58] elukey: I have found an interesting param: hive.mapred.local.mem [11:36:03] however, seems unrelated [11:39:14] are we talking about T206279 right? [11:39:15] T206279: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 [11:39:22] correct elukey [11:40:01] in the task though I don't see a OOM error.. I was trying to read the exact error that we are trying to solve [11:40:13] elukey: indeed [11:40:29] elukey: problem comes from a command that fail [11:40:51] elukey: HiveServer2 runs a hadoop command when starting a job [11:41:04] And the one it tries to launch in that precise case fails [11:41:53] but we don't have more info about the why [11:42:30] (I was about to add a comment for the hive-server2 heap size metrics) [11:42:48] anyway, need to step afk, going to test my shoulder at the gym for a bit :) [11:42:54] let's see if it has recovered :) [11:42:56] nope elukey 0 I'm fighting to try to get that why more precisly [11:43:09] I'll try to help when I am back! [13:04:22] joal: I'm back online, what's this about changes to hive that Chelsy and I will be happy about? :D [13:26:33] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Ottomata) Great thanks! Fine with Edit2, as long as there are good docs explaining as you say.... [13:45:56] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Decide whether to use schema references in the schema registry - https://phabricator.wikimedia.org/T206824 (10Ottomata) BTW, we should probably be using `$id` for schemas using their versio... [13:46:50] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Schema Registry Implementation - https://phabricator.wikimedia.org/T207869 (10Ottomata) OOPs past me had already done this and forgot. DOH merging in. [13:48:18] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) [13:48:28] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) [13:48:34] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Schema Registry Implementation - https://phabricator.wikimedia.org/T207869 (10Ottomata) [13:48:40] bearloga: o/ - try to use hive, no more parquet spam :) [13:50:29] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) [14:05:01] renamed the dashboard to ikimedia/ottomata) to #wikimedia-analytics โ”‚ [14:05:09] argh [14:05:18] how the hell did I copy/paste that :D [14:05:19] anyhow [14:05:35] renamed the dashboard to https://grafana.wikimedia.org/dashboard/db/hive [14:08:36] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Ottomata) @Pchelolo https://snowplowanalytics.com/blog/2014/05/15/introducing-self-desc... [14:10:24] Heya bearloga - Normally no more painfull parquet-logs in the middle of hive results :) [14:11:32] ncue [14:11:33] nice [14:11:46] ooooooooh [14:12:01] ottomata: I had forgotten the 'directory' line for folder creation in puppet :( [14:12:08] joal: ๐Ÿ™Œ [14:12:14] elukey: we should probably put the hadoop_cluster label on the hive jvms too [14:12:18] joal me too! [14:13:45] 10Analytics, 10Analytics-Cluster, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10elukey) Neil, today we added (finally! \o/) JVM metrics to the Hive server, you can see them in https://grafana.wikimedia.org/dashbo... [14:14:18] elukey, ottomata - The error about --^ is actually trickier than just memory [14:15:20] (03CR) 10Milimetric: Memoizing results of state functions (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [14:15:26] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10elukey) Druid upgraded on druid100[1-3], so we can finally start looking into this :) [14:16:53] I have spent they day troubleshooting, and it seems that the hive MapredLocalTask (called from the hive ExecDriver) tries to execute the wrong jar in hadoop when run in local-task mode (see [14:17:06] lovely [14:17:38] In hive server log: Executing: /usr/lib/hadoop/bin/hadoop jar /usr/lib/hive/lib/hive-common-1.1.0-cdh5.10.0.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan ....... [14:18:07] And when trying to execute this command manually well the org.apache.hadoop.hive.ql.exec.mr.ExecDriver class is not available in /usr/lib/hive/lib/hive-common-1.1.0-cdh5.10.0.jar [14:18:21] The correct jar to be used would be /usr/lib/hive/lib/hive-exec-1.1.0-cdh5.10.0.jar [14:18:54] There are conf utilities to get the jar to run, and at the end it all comes from the jar set up for the original job [14:19:57] I am kinda lost [14:20:36] elukey: no wonder :) [14:20:47] wuut weird [14:21:26] elukey: the weirdest is that hive actually runs VERY differently for a regular yarn job or for a local one [14:21:44] regular-yarn-jobs are handled through hadoop job conf, settings and all [14:22:07] local jobs are launched through a CLI ! [14:22:16] what are local jobs? [14:22:27] elukey: local-tasks should I say [14:22:56] elukey: They are a specific type of tasks (map or red) hapenning locally to prevent having to go for the yarn overhead [14:23:01] They are usefull for small data [14:24:36] !log added AAAA DNS records to all the druid nodes [14:24:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:26:36] joal: I am wondering - is it fine if I upgrade now druid public? The new version looks very stable and performant [14:27:01] +1 elukey - I need to leave to catch the kids, but I feel safe you doing it :) [14:27:13] last famous words :D [14:27:18] all right [14:27:23] ottomata: you ok with me upgrading druid public? [14:27:53] elukey: if you may - Gently on historical (for segments and cache) - please [14:28:00] Except from that, all good [14:28:09] elukey: cool yeah! [14:28:10] I am always gentle with historicals :D [14:28:17] :) [14:28:20] i' here if you need me [14:28:25] ack! [14:28:50] !log upgrade druid on druid100[4-6] to Druid 0.12.3 [14:28:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:50:55] 10Analytics, 10Operations, 10ops-eqiad: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) a:05Cmjohnson>03RobH @robh added label on server, added to switch asw-a-eqiad ge-6/0/18 up up weblog1001 and in private1-a [14:58:48] 10Analytics, 10Operations, 10ops-eqiad: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) [14:59:07] 10Analytics, 10Operations, 10ops-eqiad: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) [14:59:25] druid public upgraded! [14:59:50] 10Analytics, 10Operations, 10ops-eqiad: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Cmjohnson) [15:05:23] elukey: NICE with turnilooo [15:06:42] \o/ [15:06:52] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on analytics1029 - https://phabricator.wikimedia.org/T207644 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson I had one remaining 4TB spare disks on-site. Replaced the disk, cleared the cache and all disks are back [15:09:58] (03PS2) 10Fdans: Handle null name values in top metrics from UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468964 (https://phabricator.wikimedia.org/T206968) [15:10:23] nuria milimetric I think this one pleases everyone :) [15:10:43] fdans: we are SO high maintenance [15:12:29] WAIT DONT REVIEW I GOTTA REPLACE THE TABS WITH SPACES [15:12:33] goddammit [15:12:35] elukey: and monthly granularity mostly works cc joal wow [15:13:46] (03PS3) 10Fdans: Handle null name values in top metrics from UI [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468964 (https://phabricator.wikimedia.org/T206968) [15:15:19] elukey: https://bit.ly/2q7dBcq [15:19:26] 10Analytics, 10Analytics-Kanban: Reboot Analytics hosts for kernel security upgrades - https://phabricator.wikimedia.org/T203165 (10Cmjohnson) [15:19:30] 10Analytics, 10Operations, 10ops-eqiad: analytics1068 doesn't boot - https://phabricator.wikimedia.org/T203244 (10Cmjohnson) 05Open>03Resolved @elukey okay [15:21:36] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Cmjohnson) @elukey dbstore1002 is out of warranty and has 1.2T disks. I don't have disks this size but can replace with a 2TB disk.. [15:27:50] milimetric: after having taken a look at the transpiling of html templates (that is happening for all browsers, regardless of whether they support templates or not)... mmm ahem... i think string concatenation is much simpler [15:29:49] nuria: sorry I was in a meeting - all good right? [15:30:13] elukey: ya, super good catually, there is data we have not been able to plot in turnillo for 1.5 years that we can see now [15:31:22] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Upgrade to Druid 0.12.3 - https://phabricator.wikimedia.org/T206839 (10elukey) Druid public upgraded too! [15:33:01] \o/ [15:36:20] !log shutdown aqs1006 to replace one broken disk [15:36:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:41:46] Can one of you experienced sqoopers please check my command? I'm too scared to run it without having ever sqoop'd before. https://www.irccloud.com/pastebin/klkfNkZI/wb_terms_sqoop [16:02:41] ping milimetric [16:10:43] bearloga: plenty of sqoop docs here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration#Dumping_data_via_sqoop_from_eventlogging_to_hdfs [16:10:55] nuria: yup, that's what I'm using as reference [16:12:51] nuria: joal is helping me right now :) I'll def ping milimetric if I need more help [16:13:05] bearloga: milimetric is out today [16:20:46] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10elukey) [16:22:12] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on aqs1006 - https://phabricator.wikimedia.org/T206915 (10Cmjohnson) I sent HP a diagnostic log showing disk 5 as failed {F26794607} {F26794615} [16:25:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: turnilo x axis improperly labeled - https://phabricator.wikimedia.org/T197276 (10Nuria) 05Open>03Resolved [16:25:31] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Upgrade to Druid 0.12.3 - https://phabricator.wikimedia.org/T206839 (10Nuria) 05Open>03Resolved [16:26:06] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Nuria) 05Open>03Resolved [16:26:24] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking): Ingest data into druid for readingDepth schema - https://phabricator.wikimedia.org/T205562 (10Nuria) 05Open>03Resolved [16:28:44] 10Analytics, 10Analytics-Cluster, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10Ottomata) Reopening because we have an idea: We will set `SET hive.auto.convert.join=false;` in hive-server2 only, (not for hive CL... [16:29:09] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10Ottomata) [16:33:00] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10mpopov) >>! In T205846#4694101, @elukey wrote: > Hi @mpopov! We are in the process of migrating everybody from stat1005 to stat1007 (not announced yet for users) but I am w... [16:36:20] bearloga: o/ [16:36:22] about --^ [16:36:35] if you have 5 mins and it is quick we can do it now [16:44:29] elukey: sure [16:44:37] \o/ [16:44:49] so as far as I can see, the cron starts at 5 AM UTC [16:45:05] so I can deploy the class now to 1007, then disable the cron on 1005 [16:45:19] does it sound good? Will take me ~5 min [16:45:34] sounds good [16:46:06] gives me plenty of time to install the packages and try the scripts/queries to check whether it's safe to enable the cron on stat1007 [16:46:23] are you going to copy all the existing published-datasets files? [16:46:50] yep I am syncing the srv dir but still wip (home dirs now) [16:46:54] anything specific that you need? [16:47:20] elukey: the entirety of /srv/published-datasets/discovery [16:47:49] ah ok still not there, then let's resync next week for the move [16:48:23] since when cron runs it would use reportupdater and that would append to the existing datasets [16:48:46] ah so it needs to be moved with report updater? [16:48:51] still only on stat1005 [16:49:43] the discovery stuff is separate from AE's reportupdater stuff [16:50:29] elukey: see https://github.com/wikimedia/wikimedia-discovery-golden/blob/master/main.sh [16:50:31] ah ok I was mixing up things [16:51:20] okok so you only need things rsynced [16:51:44] yep! :) [16:52:16] bearloga: all right I'll reping you on Monday then [16:52:28] elukey: works for me! :D [16:52:29] do you have a preferred time ? [16:53:32] us east coast 10a-12p would be perfect for this [16:54:17] ok so ~16 CEST [16:54:22] seems fine to me, thanks a lot! [16:55:45] elukey: thanks for being understanding that this migration is...involved [16:59:47] 10Analytics, 10Operations, 10hardware-requests, 10User-Elukey: eqiad | (14 + 6) hadoop hardware refresh and expansion - https://phabricator.wikimedia.org/T199673 (10Cmjohnson) [16:59:50] 10Analytics, 10Operations, 10hardware-requests, 10User-Elukey: eqiad | (3) Labs Data Lake hardware - https://phabricator.wikimedia.org/T199674 (10Cmjohnson) [17:02:28] mforns: qq - what do we need to do to move report updater to stat1007? I mean, people to ping, etc.. [17:02:43] elukey, thinking [17:04:26] so I can see [17:04:27] reportupdater_reportupdater-queries-browser [17:04:31] reportupdater_limn-language-data-interlanguage [17:04:34] in the hdfs crontab [17:04:53] aha only? [17:04:59] there should be a lot more no? [17:05:03] yeah I think the others are on stat1006 [17:05:11] oh! ok, makes sens [17:05:19] 1005 only runs hive queries [17:05:51] namely (on stat1006) [17:05:52] # Puppet Name: reportupdater_limn-edit-data-edit-beta-features [17:05:52] # Puppet Name: reportupdater_limn-language-data-language [17:05:52] # Puppet Name: reportupdater_limn-flow-data-flow [17:05:52] # Puppet Name: reportupdater_discovery-stats-interactive [17:05:55] # Puppet Name: reportupdater_limn-ee-data-ee-beta-features [17:05:57] # Puppet Name: reportupdater_limn-ee-data-ee [17:06:00] # Puppet Name: reportupdater_limn-flow-data-flow-beta-features [17:06:02] # Puppet Name: reportupdater_reportupdater-queries-page-creation [17:06:05] # Puppet Name: reportupdater_reportupdater-queries-pingback [17:06:07] # Puppet Name: reportupdater_limn-language-data-published_cx2_translations [17:06:13] ok, so people to notify are just interlanguage folks [17:06:24] looking who the owner is [17:06:30] super thanks! [17:10:28] elukey, it should be Amir and Kartik [17:10:50] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) hey hey heyyy, the nodes are in! https://phabricator.wikimedia.org/T204177#4695147 How can we move this forwa... [17:12:09] elukey, but I wouldn't say they look at that data from within stat1005 [17:12:18] rather they use their dashiki dashboard: language-reportcard.wmflabs.org [17:12:39] does that one need to be updated when we move ? [17:12:45] (I am super ignorant about this) [17:13:32] elukey, the reports outputed by RU are rsync'ed to thorium, IIRC, and are requested from there by dashiki [17:13:56] ah ack, so I need to verify that afterwards then [17:13:58] so, as long as the rsync continues to copy RU reports over to former stat1001, I think all's good! [17:14:12] super [17:14:15] I'll take a note [17:14:52] thanks mforns! [17:15:03] team logging off for today, talk with you tomorrow! [17:15:52] elukey, no problem, the reports live in /srv/reportupdater/output [17:15:59] byeeeeee [17:17:17] byeee [17:33:24] dcausse: Hi! Are you still around or should I ask tomorrow? [17:33:52] joal: in standup atm, in 15/20min? [17:33:57] sure B! [17:45:11] joal: I'm around [17:45:18] yo dcausse :) [17:45:22] yo! :) [17:45:58] dcausse: I have noticed plenty workflows in suspended state in oozie (mostly discovery-clicks) [17:46:19] Is that expected? [17:46:27] what is suspended? [17:46:35] is an action on our side? [17:47:24] dcausse: The parent coord is https://hue.wikimedia.org/oozie/list_oozie_coordinator/0023681-180705103628398-oozie-oozi-C/ [17:47:34] query-clicks-hourly [17:48:12] It is in running state, but from what I see in https://hue.wikimedia.org/oozie/list_oozie_workflows/, all the workflows it uses are in suspended state? [17:48:14] I don't think that's expected... [17:50:32] I have seen a similar pattern here https://hue.wikimedia.org/oozie/list_oozie_coordinator/0048633-180705103628398-oozie-oozi-C/ [17:52:43] if it fails too many times it enters this suspend state apparently [17:53:09] can it be because it waits for too long on the webrequest and/or cirrus data? [17:53:26] dcausse: fail to many times -- I think this is something I have not yet experienced with oozie :) [17:54:25] ok dcausse - I think I see the problem [17:58:52] ? [18:06:25] dcausse: Was related to the an-coord1001 move we made a few days ago [18:06:54] dcausse: We updated the /user/hive/hive-site.xml file and inadvertantly changed its permissions, preventing the jobs to run [18:07:28] dcausse: problem manually fixed, I'm currently rerunning all the suspended jobs, and ottomata is working on making sure we don't do that again :) [18:13:24] !log Manually copy /etc/hive/conf/hive-site.xml to hdfs:///user/hive and set permissions to 644 to allow all users to run oozie jobs [18:13:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:13:36] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10Cmjohnson) [18:13:42] joal: thanks!! [18:14:03] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Cmjohnson) [18:14:24] !log Manually resume the bunch of suspended jobs (mostly from ebernhardson and chelsyx - our apologizes for not noticing earlier) [18:14:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:55:52] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10faidon) So, this is quite the can of worms :) There are several pieces to this, and honestly, I feel like VLANs is kind o... [19:02:27] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install ca-worker100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) FYI, networking considerations being worked out in {T207321} [19:09:19] 10Quarry: REPORTS-68 Implement dynamic cache duration - https://phabricator.wikimedia.org/T60826 (10Framawiki) 05Open>03Invalid Not applicable to Quarry. [19:09:21] 10Analytics, 10Analytics-Wikistats: Restore WikiStats features disabled for mere performance reasons - https://phabricator.wikimedia.org/T44318 (10Framawiki) [19:12:17] 10Quarry: Recurring queries - https://phabricator.wikimedia.org/T101835 (10Framawiki) Ca be incorporated into {T206482} [19:13:13] 10Quarry: Add date when query was last run - https://phabricator.wikimedia.org/T77941 (10Framawiki) Related: {T206482} [19:16:06] 10Quarry, 10Documentation: admin docs: quarry - https://phabricator.wikimedia.org/T206710 (10Framawiki) I was a bit trolling in my last comment of course. But ie. {T205150} is an essential task, that involve WMF peoples (join prod instance, labs one ?) and creating a monitoring tool is not at the scope of #qua... [19:17:48] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Decide whether to use schema references in the schema registry - https://phabricator.wikimedia.org/T206824 (10Ottomata) Hm a tricky bit about $refs and generating fully dereferenced schemas... [19:22:18] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10nettrom_WMF) I've created [[ https://meta.wikimedia.org/wiki/Schema:Edit2 | Schema:Edit2 ]], an... [19:38:44] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 (10JAllemandou) >>! In T206279#4695030, @Ottomata wrote: > Reopening because we have an idea: > > We will set `S... [19:49:54] 10Quarry: Recurring queries - https://phabricator.wikimedia.org/T101835 (10Wurgl) I think there is no need to run such a query by cron or similar. There will sure be forgotten queries which you execute over and over again, but no one is looking at the results. So you waste CPU-time. With some cron-like mechani... [19:52:44] ottomata, elukey: question: I see occasional messages (very rare, like one per 2 hours) in codfw.mediawiki.revision-create - is it normal? Why they are there? [19:53:26] SMalyshev: what are the messages? [19:53:33] do they look like revision-creates? [19:53:45] yes [19:53:52] is it possible an app server is creating a revision in codfw? [19:54:03] ottomata: https://pastebin.com/HPXY6tC3 [19:54:28] but why? [19:54:39] good q! [19:54:52] these seem to be completely random edits, no different from others, why they suddently went to codfw? [19:55:37] the eventbus proxy service is routed by eventbus.discovery.wmnet and lvs, maybe that resolves differently somewhere, or, maybe there is some reason these edits are actually happening in codfw [19:55:41] jobqueue maybe? [19:55:47] we probably should ask petr in service [19:56:23] no doesn't look like jobqueue [19:57:44] just one random edit per hour suddenly ends up in codfw [20:13:20] got another one, from 2018-10-25T20:03:56+00:00 [20:13:34] looks like once-per-hour thing [20:15:08] maybe some monitoring type thing? [20:16:36] but these seem to be completely random edits from different users [20:16:58] on different wikis even [20:17:40] sometimes bots, sometimes humans [20:18:43] https://grafana.wikimedia.org/dashboard/db/eventbus?refresh=1m&panelId=28&fullscreen&orgId=1 [20:18:52] it happens for other topics too [20:19:15] recentchange, revision-score [20:19:24] ah but revisions-score is a reaction to revision-create [20:19:30] so that will always happen at the same time [20:19:44] but if it is in recent change too, it does indeed look like mediawiki is sending the event [20:19:46] revision-create is 0 there [20:19:55] but it's not really 0 [20:19:58] naw [20:20:00] its not 0 [20:20:04] it spikes [20:20:08] click on just the revision-create topic [20:20:20] it look slike the behavior you are descriibing [20:21:25] yes exactly [20:21:31] so I wonder what's going on there? [20:26:38] ottomata: do you know who might have any insight into this? [20:28:01] SMalyshev: i'd ask in services and/or ops. this has to do mostly with mediawiki request routing [20:28:15] ok, thanks [20:28:18] sorry attention is slightly elsewhere atm [20:43:18] 10Quarry: Recurring queries - https://phabricator.wikimedia.org/T101835 (10Framawiki) ยป {T203791} [20:56:15] hallo [20:56:47] yesterday elukey showed me a way to look at incoming EventLogging events: `kafkacat -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_ContentTranslationAbuseFilter -o beginning` [20:58:08] the code that sends events to the ContentTranslationAbuseFilter schema was deployed earlier today. [20:58:27] if I run kafkacat now, I see an event that I'd expect to be logged in the output. [20:58:48] but I don't yet see it in MySQL [20:59:20] (03CR) 10Joal: [C: 031] "LGTM !" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/467700 (owner: 10Milimetric) [20:59:42] that is, I'd expect to run `use log;` and `show tables;`, and to see the ContentTranslationAbuseFilter table, but I don't see it [20:59:54] will it be auto-created? when can I expect to see it? [21:03:03] aharoni: I'm no expert in EL, but I think the system must be restarted to pick the latest config (the one that includes your topic in the mysql whitelist) [21:03:40] joal: ow... does it happen regularly? Or do I have to request it? [21:03:59] aharoni: I would think it happens regularly - I'll ping Luca tomorrow about that [21:04:02] aharoni: you have to request it [21:04:16] we used to blacklist things from going to mysql, now we whitelist things [21:04:16] Thanks for teliing me wrong ottomata [21:04:18] :) [21:04:42] ottomata: it's whitelisted in puppet. So what's the process for requesting it? [21:04:44] OH [21:04:46] it is? [21:04:49] sorry [21:04:59] then ya i think we just gotta bounce the processer, checking... [21:05:02] something else interesting here: no ContentTranslationAbuseFilter table in hive either [21:05:10] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469419/ [21:05:11] ottomata: --^ [21:05:24] joal: yes, not in hive either [21:05:25] ottomata: no log about EL processors having been bumped [21:05:29] great [21:06:00] ya they weren't bounced [21:06:01] doing now [21:06:23] ottomata: I thought the whitelist was for Mysql only [21:06:38] !log bouncing eventlogging-processor client side* to pick up mysql whitelist change for ContentTranslationAbuseFilter (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469419/) [21:06:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:10:47] 10Analytics, 10EventBus, 10Operations, 10Wikidata, and 7 others: WDQS Updater ran into issue and stopped working - https://phabricator.wikimedia.org/T207817 (10mobrovac) [21:11:25] ottomata: joal - is it done now? [21:11:32] 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Pchelolo) [21:11:36] (or am I extremely impatient?) [21:13:47] aharoni: yes it is, takes minutes [21:14:07] aharoni: did you tested your schema in beta cluster to make sure your events are valid? before going to prod? [21:14:26] aharoni: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster [21:15:12] 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10mobrovac) Could it have to do with automatic checks somehow? [21:18:23] 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Pchelolo) Nope. The events are all legitimate edits, just random portion of them. Here's an example: ``` {"comment": "/* wbeditentity-update:0| */ Updat... [21:20:50] aharoni: Just check logs in hive refine process - Last hourly job had processed hour-17, and your schema has 2 events for hour 19 [21:21:50] aharoni: On the hive side (the one I kinda know better) I think your events will show up in say, ~2 hours :) [21:22:49] Gone for tonight team - See you tomorrow [21:26:09] 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10mobrovac) Hm interesting. These are all original events, so either the EventBus proxy service messes up the DNS (less likely) or somehow legitimate reques... [21:44:51] 10Analytics, 10Multimedia: Add mediacounts to pageview API - https://phabricator.wikimedia.org/T88775 (10jmatazzoni) If you're looking for a project that might use this API, the [[ https://phabricator.wikimedia.org/project/manage/3543/ | Event Metrics tool ]] would love to be able to get an accurate count for... [21:45:42] 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Pchelolo) Things I've checked so far: - There are no logs anywhere associated with these events. - All the events are legitimate edits, they exist in DB,... [22:08:55] 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10mobrovac) >>! In T207994#4696066, @Pchelolo wrote: > Things I've checked so far: > - There are no logs anywhere associated with these events. > - All the... [22:24:53] 10Analytics, 10EventBus, 10Services (later): revision-create events are sometimes emitted in a secondary DC - https://phabricator.wikimedia.org/T207994 (10Pchelolo) We can keep theorizing about this, but we need more information before we could make any real theory. I think the first and foremost we need t... [23:32:04] 10Analytics, 10EventBus, 10MediaWiki-Watchlist, 10WMF-JobQueue, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Etonkovidova) I checked testwiki, cawiki (wmf.1), enwiki (wmf.26) and betalabs. testwiki, cawiki (wmf.1) Deletes 50 items... [23:50:21] 10Analytics, 10EventBus, 10MediaWiki-Watchlist, 10WMF-JobQueue, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Pchelolo) @Etonkovidova sorry I didn't update this ticket.. We've had an outage caused by the fix to it so it was reverted t... [23:53:10] 10Analytics, 10EventBus, 10MediaWiki-Watchlist, 10WMF-JobQueue, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Etonkovidova) >>! In T207329#4696235, @Pchelolo wrote: > @Etonkovidova sorry I didn't update this ticket.. We've had an outa...