[05:15:19] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Performance tweaks for state management in wikistats - https://phabricator.wikimedia.org/T207352 (10Nuria) [05:29:48] (03PS1) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [05:33:06] (03CR) 10jerkins-bot: [V: 04-1] Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [06:56:54] (03CR) 10Elukey: [V: 032 C: 032] Add StringMessageDecoder to the list of kafka coders [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468016 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [07:00:52] the hdfs balancer is running now via systemd timer! \o/ [07:01:07] and it is also logging fine [07:01:11] so it works [07:07:09] (03PS1) 10Elukey: Release 0.1.0-wmf9 [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468224 [07:09:22] \o/ ! Thanks elukey :) [07:11:04] joal: I am going to send a code review for sqoop in a bit :) [07:11:09] and then camus is next [07:11:27] (that is the tricky one in puppet) [07:11:31] joal: bonjourrrrr [07:11:38] do you have a minute for a pom.xml question? [07:11:39] Bonjour elukey :) [07:11:51] Thanks a milion for those moves- This will help a lot I'm sure! [07:11:53] elukey: sure [07:12:11] "I love it when alarming comes together" :P [07:12:17] hehe :) [07:12:40] so I added a new coder to camus (wmf branch) for the eventlogging-client-side use case [07:13:30] and now I am trying to figure out how to make a change to the pom.xml(s) to update the version number and then possibly: [07:13:33] 1) upload to archiva [07:13:46] 2) make the change to refinery source [07:15:02] elukey: at least one thing to discuss here - is your patch on camus repo, or on refiner-source one? [07:15:59] camus repo, wmf branch [07:16:10] https://gerrit.wikimedia.org/r/468016 [07:16:28] Ok - why not refinery-camus? [07:16:42] elukey: there is a coders folder in there too [07:17:24] I had no idea, I thought that we imported camus via our repository [07:17:35] I can send another patch for refinery camus [07:17:44] (03Abandoned) 10Elukey: Release 0.1.0-wmf9 [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468224 (owner: 10Elukey) [07:18:17] elukey: My understanding of it (which is for sure partial): camus repo is for "core camus", and the code-by camus (coders, schema etc) are in refinery-camus [07:18:23] But I might be completely wrong [07:19:20] mmm so ./src/main/java/org/wikimedia/analytics/refinery/camus/coders doesn't include the JsonStringMessageDecoder though [07:19:44] that we use for all the camus configs [07:21:36] elukey: example - The camus script we use to launch jobs has a --libjars option, allowing to pass it additional jars (for code) - The only cron using it is mediawiki-api data, for the avro code [07:22:05] This makes me assume that the JsonDecoder is by default included in the camus repo, but could have been included through the refinery-camus [07:22:26] Now, add your code in refinery-camus also means adding he libjars to the script [07:22:55] I have no strong opinion - I just prefer to touch refinery-source only when feasible (I'm afraid of camus failures) [07:23:36] well this is a new coder, it will be used by the camus(es) that explicitly set it in the config, cannot break other things [07:23:41] If you go for camus update, then the process is as you described: modify pom manually, build locally, upload to archiva (jar + pom) [07:24:52] but in that case, how should the pom be modified? only bumping the version of the coders and the camus overal ? [07:27:21] elukey: I think we would bump the version number in every subproject (asd of now it is coherent) [07:28:26] ah ok all to -wmf9 [07:28:32] -wmf-8 no? [07:28:42] elukey: https://github.com/wikimedia/analytics-camus/commit/fe75f1b0577d6e0ec023462c46ba41e3faea4373 [07:29:46] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1003 is CRITICAL: CRITICAL [07:30:05] ahhh andrew already bumped the version after wmf7! [07:30:06] https://github.com/wikimedia/analytics-camus/commit/0b46f051260af41207a499b61c4159ace96379b9 [07:30:09] nice [07:30:25] now it makes sense, I was seeing wmf7 in refinery's pom.xml and I was confused [07:30:34] so I don't have to touch anything [07:32:10] hm - You should move 8-SNAPSHOT to 8, and once dpeloyed, make it 9-SNAPSHOT, no? [07:32:35] no idea about the convention [07:33:02] ahhh ok I am getting it [07:33:04] sorry I am slow [07:33:08] np :) [07:33:28] I need to remove the SNAPSHOT suffix now, relase, add the 9-SNAPSHOT [07:33:39] elukey: those manual steps are not so nice - This is another reason for which I prefer to add code to refinery instead of camus if possible :) [07:33:48] sounds correct to me elukey [07:34:49] I am going to wait for Andrew then, yesterday we discussed about releasing but sounds like we need an agreement first :) [07:35:01] I just want to make the eventlogging-client-side working :P [07:35:45] I'm all with you on that side - If you prefer to go for a camus update as I said, I'm fine with it :) The process is manual but should be ok :) [07:41:23] (03PS1) 10Elukey: TestStringMessageDecoder: don't rely on time being monotonic [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468246 [07:42:39] (03CR) 10Elukey: [V: 032 C: 032] TestStringMessageDecoder: don't rely on time being monotonic [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468246 (owner: 10Elukey) [07:46:27] (03PS1) 10Elukey: Release 0.1.0-wmf8 [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468249 [07:46:42] I'll leave this in here --^ [07:47:35] (03CR) 10Elukey: "~/WikimediaSource/camus git grep SNAPSHOT" [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468249 (owner: 10Elukey) [07:48:31] elukey: if you want we can move :) [07:51:27] elukey: my comments were more about my view on simplicity than anything else - If you're readu to deploy a new version of camus, let's move :) [07:52:39] joal: the only thing missing would be a bit of testing to see if it effectively solves the issue.. the unit test seems working but real world data is different :) [07:53:00] elukey: manually built jar should do I think [07:54:27] joal: and then copy it to an-coord1001 and run the thing manually? [07:54:44] elukey: simplet IMO would be to build the jar on the stat machine [07:55:08] elukey: clone repo, apply patch, build, and run [07:56:23] ack I'll try [07:56:33] elukey: currently doing it [07:56:34] in the meantime: https://gerrit.wikimedia.org/r/468251 [07:56:35] :) [07:57:05] elukey: your patch for version doesn't seem to be dependent of the code-patch :) [07:57:20] what do you mena? [07:57:22] *mean [07:57:23] ? [07:57:41] hm - The real code patch on decoder - has it already been merged? [07:58:00] It seems it has :) [07:58:08] same with the test - ok my bad :) [07:58:09] sorry [07:58:43] yep yep [07:59:55] When I pull the wmf branch over master, I need a merge-commit elukey :( [08:00:57] ah yes I know, Andrew told me to work on wmf only, since in theory master should track upstream [08:01:08] Ah ! [08:01:14] ok :) [08:01:22] camus built elukey :) [08:01:45] what maven command did you use? I tested mvn test [08:01:50] (curious and ignorant) [08:02:03] elukey: on stat1004:/home/joal/code/camus/camus-wmf/target/camus-wmf-0.1.0-wmf8-SNAPSHOT.jar [08:02:08] mvn clean package [08:02:46] clean first (force recompile and all), then package (generate jar) [08:03:30] clean package, nice [08:03:54] elukey: two commands in a row for maven means first command, then command 2 [08:04:19] before testing camus, do you mind to review with me https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468251/2/modules/profile/manifests/analytics/refinery/job/sqoop_mediawiki.pp ? [08:04:28] I mean the commands and the timings [08:04:43] reading [08:05:20] the only thing that I don't like now is that in the systemd unit there will be a password in cleartext [08:05:46] no passwd in any of those command elukey - everything through files [08:06:09] -p ${db_password_private} ? [08:06:21] ah yes a file [08:06:22] gooood [08:06:31] We should rename those variables :) [08:09:05] 10Analytics, 10Analytics-Kanban, 10Cleanup, 10GitHub-Mirrors: Delete wikimedia/kraken repository from github - https://phabricator.wikimedia.org/T207204 (10hashar) 05Open>03Resolved https://github.com/wikimedia/kraken is now marked as archived in Github. It has a few issues and pull requests so I guess... [08:09:08] Things I don't know about puppet/cron/systemd escaping elukey: do we need to escape the % in date sub-commands as we did for cron [08:09:46] This is a problem I always run into when getting commands from cron to run manually: the % is escaped for cron in date-subcommands [08:10:03] ah good point! [08:10:08] $(/bin/date +\\%Y-\\%m-15) [08:10:10] I have no idea :D [08:10:21] Shall that be: $(/bin/date +%Y-%m-15) in systemd? [08:11:23] in theory yes [08:11:44] Also something I'd like explanation on: in systemd_env, we escape the $ sign to prevent trying to interpret it as a variable [08:12:07] However, in cron command, we don't escape the $ for subcommands ?? [08:12:23] * joal is lost in escapation [08:12:30] https://stackoverflow.com/questions/45999187/creating-filename-date-y-m-d-from-systemd-bash-inline-script [08:13:13] To pass a literal dollar sign, use "$$" [08:13:16] Wow [08:13:45] elukey: all the timers you created before didn't make use of the date command? I'm surprised :) [08:14:04] I think they didn't but of course now I need to check [08:14:56] yep seems sqoop is the only one [08:15:26] elukey: do ou mind sharing the command you use to check for timers (I need to learn it, but so far I don't recall) [08:15:41] systemd list-timers [08:15:47] err systemctl list-timers [08:15:50] easy enough :) [08:16:24] \0/ ! [08:16:27] (the list of timers is amazing) [08:22:01] elukey: I have questionz! [08:22:25] sure! [08:22:41] elukey: I have found the timers desc on an-coord1001 [08:23:03] elukey: They reference a service - Where can I find the services? [08:24:27] * joal aplogizes for ignaorance in systemd :( [08:25:39] nono please ask questions :) [08:26:08] so if you grab the name.timer, than you can check with systemctl cat name.service [08:26:25] or systemctl status name.service to check the last logs and return codee [08:26:28] *code [08:27:08] right - I was more wondering about the things are setup internally - timers files reference a service, and I don't know how this service is defined :( [08:27:47] Ah! the cat command told me :) [08:27:56] Solved problem elukey :) [08:29:04] elukey: I however wonder why the file doesn't show up with ls :( [08:30:24] joal: what file do you mena? [08:30:27] *mean? [08:30:58] elukey: batcave for minute, might be easier :) [08:31:18] ? [08:31:33] coming! [08:45:23] 10Analytics, 10Analytics-Kanban, 10Cleanup, 10GitHub-Mirrors: Delete wikimedia/kraken repository from github - https://phabricator.wikimedia.org/T207204 (10Dinoguy1000) >>! In T207204#4676624, @hashar wrote: > It has a few issues and pull requests so I guess it should not be deleted. Normal procedure for... [09:04:49] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Refactor analytics cronjobs to alarm on failure reliably - https://phabricator.wikimedia.org/T172532 (10elukey) [10:31:13] heya team [10:32:16] o/ [10:34:30] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 4 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10mobrovac) [11:04:50] joal: in the end I created two new files to reference in the units [11:05:03] life is too short to battle with escaping [11:05:52] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468251/ [11:07:49] (03CR) 10Mforns: [V: 032 C: 032] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/467440 (https://phabricator.wikimedia.org/T136851) (owner: 10Nuria) [11:13:15] elukey: that puppet patch still needs the refinery code to go through, so nothing to do yet :) [11:17:31] fdans: yep yep but even after that my refactoring will make is useless, sorry :( I'll add the change after merging mine! (and after refinery's code is deployed) [11:17:41] *it useless [11:18:06] how dare you elukey [11:19:21] ahahha [11:27:27] (03CR) 10Mforns: [C: 031] "Code looks good! But some indentation typos." (034 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [11:42:04] elukey: about the refacto - SHouldn't we have the same erb template for both soop and sqoop-private? [12:03:34] joal: the commands are different, it might be cumbersome in puppet to keep the same template and render them differently [12:05:01] I can recheck and try to factorize a bit [12:08:17] if possible I'd keep them in this way [12:33:30] no problem elukey :) [12:36:47] going afk now team, be back in ~2h :) [12:38:49] (03PS1) 10Fdans: Add mediawiki_history_reduced to list of tables to drop snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) [12:40:19] (03CR) 10Fdans: [V: 031] "Tested with dry run, listed correctly partitions to drop." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) (owner: 10Fdans) [13:17:32] (03CR) 10Ottomata: [C: 031] Release 0.1.0-wmf8 [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468249 (owner: 10Elukey) [13:29:21] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 4 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Ottomata) Just checked, python jsonschema validates milliseconds ISO-8601s with date-time format just fine. :) [13:31:17] (03CR) 10Ottomata: [C: 031] Add mediawiki_history_reduced to list of tables to drop snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) (owner: 10Fdans) [14:03:53] (03PS5) 10Joal: Add webrequest_subset_tags transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465206 (https://phabricator.wikimedia.org/T164020) [14:04:51] (03PS6) 10Joal: Add webrequest_subset_tags transform function [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/465206 (https://phabricator.wikimedia.org/T164020) [14:07:05] (03PS1) 10Joal: Add WebrequestSubsetPartitioner spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468322 (https://phabricator.wikimedia.org/T164020) [14:09:46] joal: maybe you want to use Refine + Refine target intead of manually passing in year month day hour? [14:09:58] otherwise you'll run into the same problems marcel is with el2druid? [14:10:02] unless oh i guess you are going to oozie it [14:10:04] ... [14:10:04] udh [14:10:05] duh [14:10:06] :) [14:10:07] nm. [15:25:56] fdans: need to join 1 on 1 5 minutes late [15:26:07] no probs! [15:58:35] ottomata: Heya - I actually double checked RefineTarget, but PartitionedDataFrame with dynamic-partitions doesn't do well with them [16:01:04] a-team, I'm attending the house's maintenance electrician, will be late to stand-up, sorry! [16:01:30] a-team: standdupp ping ottomata milimetric [16:03:55] AHHH [16:04:03] OHHH :) [16:05:07] UHHHH [16:14:26] (03CR) 10Milimetric: Set the active filter correctly on breakdowns mount (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468027 (https://phabricator.wikimedia.org/T206822) (owner: 10Fdans) [16:15:59] (03CR) 10Milimetric: [C: 031] "my comments are fully addressed, thanks" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/465416 (owner: 10Fdans) [16:22:09] (03CR) 10Milimetric: Memoizing results of state functions (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [16:32:22] joal: I am running camus on my /user/elukey dir on HDFS with your jar, let's see if it works :) [16:32:31] \o/ [16:33:46] estimated_size:133245641728 [16:33:49] a-team: I have a SWAP notebook that keeps crashing while running a long query...I suspect that it's running out of memory, but the query's not *that* big—how could I investigate that further? [16:34:03] or rather, the output is not *that* big [16:34:32] neilpquinn: o/ - what notebook? 1003 or 1004? [16:34:40] neilpquinn: 2 solutions, no user-friendly one :( [16:34:41] 1003 [16:34:52] neilpquinn: I'd run the query manuall in hive and check [16:35:18] neilpquinn: Or I'd monitor yarn for the query, get the application-id and look at the logs (even worse) [16:35:33] joal: but I think the issue is that loading it all into memory in the Python kernel consumes too much memory...so YARN or hive wouldn't know anything about that [16:35:41] yeah I was about to ask what do you mean with "crashing" [16:35:43] Ah ! [16:35:52] I was assuming the query doesn't finish [16:36:10] My bad neilpquinn - Indeed, if the data is too big for python, well, there not much I can do :) [16:36:12] joal, elukey: yeah, the notebook itself restarts...when I open it up in the morning, all the variables are gone [16:36:32] ah wait so it wasn't now that happened right? [16:36:39] Because this morning I recall this (Eu time) [16:36:40] [Thu Oct 18 07:27:00 2018] Out of memory: Kill process 4875 (python3) score 221 or sacrifice child [16:37:06] neilpquinn: now that I get that, can you precise "not that big"? [16:37:06] the kernel's OOM killer killed one python3 processs [16:37:09] elukey: I ran this query once about 24 hours ago and once about 10 hours ago [16:37:24] yeah so that kill was your process :) [16:38:13] elukey: okay, thanks, that help—I guess it would have 10–100 million rows so...maybe that big actually :p [16:39:23] elukey: so how can I get the feedback that the kernel was killed because of lack of memory? I only figured out this was the issue after I ran it a second time [16:39:55] neilpquinn: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=notebook1003&var-datasource=eqiad%20prometheus%2Fops&from=1539841199230&to=1539852249252 [16:40:20] so the baseline seems high [16:40:25] (used memory I mean) [16:40:31] and your process crossed it [16:40:40] that was around 14G afaics [16:40:49] (from the kernel logs) [16:41:10] elukey@notebook1003:~$ free -m total used free shared buff/cache available [16:41:13] Mem: 64338 51284 11206 582 1846 11930 [16:41:16] Swap: 953 953 0 [16:41:18] mmmm [16:41:54] so for example, there are some big processes ran under "otto" in top :P [16:41:57] java processes [16:43:57] neilpquinn: so yeah there are also other notebooks taking a bit of memory space, and we have 64G total [16:44:10] interesting ! [16:44:34] elukey: yes, interesting [16:45:04] neilpquinn: for multi-Gb data size, I suggest you use spark :) [16:46:10] joal, elukey: yeah, using Spark seems reasonable...it would just be helpful to get some direct feedback that memory limits are the issue. Would it be difficult to have the OOM killer send an email when it kills a kernel? [16:47:03] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: "Anonymous Editor" is a broken link - https://phabricator.wikimedia.org/T206968 (10Nuria) a:05JAllemandou>03fdans [16:47:43] neilpquinn: yeah in theory is not that configurable :( [16:47:51] (I am checking with ps aux --sort -rss | head the current usage) [16:49:53] 10Analytics: [EL2Druid] Make RefineTarget compatible with Druid and use it from EventLoggingToDruid - https://phabricator.wikimedia.org/T207207 (10fdans) p:05Triage>03Normal [16:50:50] 10Analytics, 10Tool-Pageviews: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10fdans) p:05Triage>03Low [16:52:16] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Performance tweaks for state management in wikistats - https://phabricator.wikimedia.org/T207352 (10fdans) p:05Triage>03High [16:56:30] 10Analytics: Serve legacy code only to legacy browsers - https://phabricator.wikimedia.org/T207311 (10fdans) p:05Triage>03Normal [16:58:12] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10fdans) [16:58:25] ah joal I, of course, started the job calling it "eventlogging-client-side" -.- [16:58:29] killing... [16:58:35] MWHAHAHA :) [16:58:55] inceptions elukey - Stuff already in the mind stay there [17:01:13] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1003 is OK: OK [17:07:28] 10Analytics: Research: add participant list to some of AQS edit api operations - https://phabricator.wikimedia.org/T206137 (10Milimetric) p:05Low>03Normal [17:07:30] 10Analytics: Research: add participant list to some of AQS edit api operations - https://phabricator.wikimedia.org/T206137 (10Milimetric) p:05Normal>03Low [17:09:21] elukey: your camus job is running :) [17:10:02] 10Analytics: Measure portal pageviews - https://phabricator.wikimedia.org/T162618 (10Milimetric) @mpopov if this is still important, there's an easy way to do this now with tags. We put it on deprioritized for now, let us know if we should focus on it. [17:10:43] 10Analytics, 10Research: wdqs_extract jobs should probably be stopped and deleted - https://phabricator.wikimedia.org/T191037 (10Milimetric) 05Open>03Resolved a:03Milimetric done by @elukey [17:11:26] 10Analytics, 10goodfirstbug: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171 (10Milimetric) [17:11:50] 10Analytics, 10goodfirstbug: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171 (10Milimetric) p:05Low>03Normal [17:11:55] 10Analytics, 10goodfirstbug: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171 (10Milimetric) [17:13:07] 10Analytics, 10Analytics-EventLogging, 10Performance-Team (Radar), 10Readers-Web-Backlog (Tracking): Make it easier to enable EventLogging's debug mode - https://phabricator.wikimedia.org/T188640 (10Milimetric) p:05Low>03High [17:13:25] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Performance-Team (Radar), 10Readers-Web-Backlog (Tracking): Make it easier to enable EventLogging's debug mode - https://phabricator.wikimedia.org/T188640 (10Milimetric) a:03Milimetric [17:14:44] 10Analytics: Consider changing EventLogging to encode events using base64 instead of uriEncode - https://phabricator.wikimedia.org/T199148 (10Milimetric) 05Open>03declined in favor of letting Modern Event Platform handle this [17:17:19] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings, 10Product-Analytics, and 3 others: Ingest data from PageIssues EventLogging schema into Druid - https://phabricator.wikimedia.org/T202751 (10Milimetric) p:05Normal>03High [17:17:29] 10Analytics, 10Analytics-Kanban: Remove sessionId, pageId pairs from whitelist - https://phabricator.wikimedia.org/T205458 (10Milimetric) p:05Normal>03High [17:17:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Parametize eventlogging to druid ingestion with a whitelist instead of a blacklist - https://phabricator.wikimedia.org/T206342 (10Milimetric) p:05Normal>03High [17:17:40] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking): Ingest data into druid for readingDepth schema - https://phabricator.wikimedia.org/T205562 (10Milimetric) p:05Normal>03High [17:17:51] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar): Explore NavigationTiming by faceted properties - EventLogging refine - https://phabricator.wikimedia.org/T166414 (10Milimetric) p:05Normal>03High [17:17:56] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Create .deb package for Presto - https://phabricator.wikimedia.org/T203115 (10Milimetric) p:05Normal>03High [17:18:41] 10Analytics: Deprecate reportcard: https://analytics.wikimedia.org/dashboards/reportcard/ - https://phabricator.wikimedia.org/T203128 (10Milimetric) [17:18:59] 10Analytics: Deprecate reportcard: https://analytics.wikimedia.org/dashboards/reportcard/ - https://phabricator.wikimedia.org/T203128 (10Milimetric) p:05Normal>03High [17:21:31] 10Analytics, 10Analytics-Kanban: Enable automatic ingestion from eventlogging into druid for some schemas - https://phabricator.wikimedia.org/T190855 (10Milimetric) p:05Normal>03High [17:22:24] 10Analytics, 10Analytics-Kanban: reportupdater TLC - https://phabricator.wikimedia.org/T193167 (10Milimetric) [17:23:34] 10Analytics: [EL sanitization] Make WhitelistSanitization support arrays of structs, maps or other arrays - https://phabricator.wikimedia.org/T199230 (10Milimetric) p:05Normal>03Low [17:25:51] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 (10Milimetric) p:05Normal>03High [17:28:59] 10Analytics-Tech-community-metrics: Enable Discourse backend in wikimedia.biterg.io once discourse.mediawiki.org gets into production - https://phabricator.wikimedia.org/T186513 (10Aklapper) 05Open>03stalled p:05Low>03Lowest [17:33:34] 10Analytics: Sqoop more tables for mediawiki in monthly schedule - https://phabricator.wikimedia.org/T198983 (10Milimetric) This adversely affects mediawiki-history because it increases the load on the databases and might upset the DBAs. Let's reprioritize it later when we have a better system for importing med... [17:33:42] 10Analytics: Sqoop more tables for mediawiki in monthly schedule - https://phabricator.wikimedia.org/T198983 (10Milimetric) p:05Normal>03Low [17:36:38] 10Analytics, 10Analytics-Kanban: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users - https://phabricator.wikimedia.org/T204950 (10Milimetric) p:05Normal>03High [17:37:41] 10Analytics: AQS edits API should not allow queries without time bounds - https://phabricator.wikimedia.org/T189623 (10Milimetric) p:05Normal>03High [17:38:55] 10Analytics, 10Research: Migrate pagecounts-ez generation to hadoop - https://phabricator.wikimedia.org/T192474 (10Milimetric) [17:39:03] 10Analytics, 10Research: Migrate pagecounts-ez generation to hadoop - https://phabricator.wikimedia.org/T192474 (10Milimetric) p:05High>03Normal [17:39:33] 10Analytics: Generate pagecounts-ez data back to 2008 - https://phabricator.wikimedia.org/T188041 (10Milimetric) p:05High>03Normal [17:47:02] ottomata: a query like: select * from eventerror where year=2018 and month=10 and hour in (1,2,3,4,5) and event.schema like '%central%' limit 10000 ; [17:47:10] ottomata: gives a ton of parquet errors [17:47:30] https://www.irccloud.com/pastebin/fhn0MzMB/ [17:48:28] i think just warning/info nuria [17:48:30] do you get restults? [17:49:48] ottomata: no, no results [17:52:20] looking [17:53:28] ottomata: i think there are no results, that is fine, is just the warnings are a bit confusing [17:53:57] nuria: ya [17:54:00] select * from eventerror where year=2018 and month=10 and day=18 and hour=1 limit 3; gives me results [17:54:04] but day=17 does not [17:55:01] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) We have looked into whether the source of discrepancy could be unvalid events and... [18:00:12] joal, ottomata - so the new camus job finished (results in /user/elukey/camus) but I can see some errors like [18:00:16] Topic eventlogging-client-side:6 not fully pulled, max task time reached at 2018-10-18T17:58:05.167Z, pulled 43112573 records [18:00:44] I think that it was trying to pull a week worth of data [18:00:46] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) Other possible cause could be that not all impressions are being sent by your EL c... [18:00:48] elukey: this means that there was too much data to pull (which seems normal, it started from 7 days ago) [18:01:13] so it works! But we'll need to think about the first run then? [18:01:59] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) a:05mforns>03JAllemandou [18:02:18] (maybe tuning timeouts) [18:02:34] ya elukey that is correct [18:02:39] (03CR) 10Elukey: [V: 032 C: 032] Release 0.1.0-wmf8 [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468249 (owner: 10Elukey) [18:02:46] \o/ ! thanks a lot elukey :) [18:02:46] but since that job only has a single topic [18:02:50] you can remove the timeout [18:03:06] the only reason we have a timeout is so a long/large topic task doesn't block others from running [18:03:09] so we can reschedule more often [18:03:15] ahhh snap there is a timeout! [18:03:21] ya in the .properties file [18:03:24] kafka.client.so.timeout=60000 [18:03:27] hmmm no not that one [18:03:47] there is a #kafka.fetch.request.max.wait= [18:03:56] ah no [18:03:56] kafka.max.pull.minutes.per.task=55 [18:03:57] kafka.max.pull.minutes.per.task=9 [18:04:02] super [18:04:07] that one [18:04:07] yup [18:04:21] shall we release it?? [18:04:21] ottomata, you already +1d it, but you think this can be merged? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465692/ If it fails, at least it does not break anything existing already, as it is a new job [18:05:13] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) Ping @AndyRussG regarding the question: "Are these banners shown to ALL browsers o... [18:06:09] 10Analytics, 10Product-Analytics: Sqoop more tables for mediawiki in monthly schedule - https://phabricator.wikimedia.org/T198983 (10Neil_P._Quinn_WMF) [18:07:14] test [18:12:29] elukey: yes [18:12:31] mforns: yes! [18:12:46] :D [18:12:54] running pc for yours mforns [18:13:28] ottomata: now idea how :D [18:13:40] I only got to mvn clean package (thanks to Joseph) [18:14:52] mforns: looks about right: https://puppet-compiler.wmflabs.org/compiler1002/13078/an-coord1001.eqiad.wmnet/ [18:14:55] going to merge! [18:14:59] elukey: with you 2 mins [18:16:05] ack! [18:17:10] elukey: i think mvn release is configured for this [18:17:10] mind if I try? i don't know if it will work [18:17:47] mforns: jobs installed on an-coord1001 [18:18:03] gonna manually run one of them.. [18:18:11] oh yes! [18:18:21] please do :) [18:20:33] elukey i think that maybe your manualcommit of the new version in poms might not have been necessary.... checking [18:20:41] right now i'm doing mvn release:prepare [18:20:52] which does that stuff [18:20:52] https://github.com/wikimedia/analytics-refinery-source#releases [18:20:52] (folling taht atm) [18:20:53] thanks ottomata ! [18:20:53] (following that atm) [18:21:26] mforns: java.lang.ClassNotFoundException: org.wikimedia.analytics.refinery.job.refine.EventLoggingToDruid [18:21:26] hm [18:21:47] what? [18:22:02] ottomata: I think you did the same for the past release no? (seeing git commits) [18:22:13] maybe! but they might have been done automatically by mvn [18:22:17] don't really remember! [18:23:29] ottomata, it's outside of refine... :/ [18:24:11] ooop [18:25:01] elukey: i guess that's ok? it doesn't use refine directly right? [18:25:11] just change the reference in puppet? [18:25:17] or do you tuihnk it shoudl go in refine/ ? [18:26:33] oops [18:26:44] sorry, that was supposed to be an mforns ping, not elu key [18:27:03] no no ottomata, I'm just writing the quickfix [18:27:16] it's just removing the '.refine' [18:27:18] k [18:27:19] coo [18:28:03] (going to dinner, brb!) [18:29:31] ottomata, https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468374/ [18:29:55] ottomata, why it should have executed it so fast I don't understand... [18:30:13] it's not 20:00 rather 20:30 [18:30:23] it should have triggered at the full hour... [18:35:50] elukey: its working, am updating docs in readme [18:35:58] mforns: i did it manually [18:36:03] oh! ok [18:36:14] elukey: yeehaw [18:36:15] https://archiva.wikimedia.org/repository/snapshots/org/wikimedia/analytics/camus-wmf/0.1.0-wmf8-SNAPSHOT/ [18:36:47] trying again mforns :) [18:36:52] yea [18:37:18] (03PS1) 10Ottomata: Update README.md with wmf deploy instructions and skip javadoc during release [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468378 [18:38:03] (03PS2) 10Ottomata: Update README.md with wmf deploy instructions and skip javadoc during release [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468378 [18:38:06] elukey: ^ :) [18:38:22] (03CR) 10Ottomata: [V: 032 C: 032] Update README.md with wmf deploy instructions and skip javadoc during release [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/468378 (owner: 10Ottomata) [18:38:47] should be good to go, now you just gotta update artifacts in refinery [18:39:38] mforns: nice! [18:39:38] 18/10/18 18:38:07 INFO DataFrameToDruid: Druid ingestion task index_hadoop_event_navigationtiming_2018-10-18T18:37:07.303Z for event_navigationtiming succeeded. [18:39:39] 18/10/18 18:38:07 INFO EventLoggingToDruid: Done. [18:39:42] ottomata: thankkkkss!!! [18:39:52] Woohooooo! [18:40:04] checking Turnilo [18:41:14] ottomata: should it be 0.1.0-wmf8-SNAPSHOT or simply 0.1.0-wmf8 ? [18:42:53] 0.1.0-wmf8 [18:42:57] oh [18:43:04] why oh that's the snapshot [18:43:09] looking [18:43:52] hm [18:43:53] why [18:43:55] grrr [18:43:57] mmmm sth is weird [18:44:58] (03PS1) 10Elukey: Upgrade camus-wmf dependency to camus-wmf8 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468381 [18:47:00] elukey: i'm doing another release i think maybe something went weird... [18:48:28] sorry for the extra boring work :( [18:48:30] thanks a lot [18:50:55] (03CR) 10jerkins-bot: [V: 04-1] Upgrade camus-wmf dependency to camus-wmf8 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468381 (owner: 10Elukey) [18:53:01] ah yes it doesn't find the new dep in archiva [18:53:12] i think its working this time elukey [18:53:29] (03CR) 10Elukey: "recheck" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468381 (owner: 10Elukey) [18:53:50] hang on the release is changing [18:54:06] ahhh [18:55:54] ther ewe go [18:55:55] https://archiva.wikimedia.org/repository/releases/org/wikimedia/analytics/camus-wmf/0.1.0-wmf9/ [18:56:06] elukey: ^ [18:56:15] ah 9? [18:57:24] (03PS2) 10Elukey: Upgrade camus-wmf dependency to camus-wmf9 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/468381 [18:58:41] ya i just bumped it again [18:58:46] rather than deleting the old one in tags, etc. [18:59:02] ack! [18:59:25] so now I'd need to deploy refinery source (next week with the other changes, if any) and then refinery [18:59:33] and eventually re-enable the camus cron [19:02:47] all right then, logging off for today [19:02:51] thanks a lot ottomata! [19:02:56] * elukey off! [19:06:22] ottomata, I think that the jobs are good puppet-wise, however, the time_measures are not being loaded! aaaargh! they are passed correctly and I tested EventLoggingToDruid many timessss! damn, there's probably a bug in scala code... :[ looking [19:09:24] laters elukey! [19:09:33] oooo yuci [19:13:55] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10MMiller_WMF) a:03nettrom_WMF [19:14:24] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10MMiller_WMF) @nettrom_WMF is going to lay out the next steps for this, and tag engineers as app... [19:26:54] joal yt? :] qq: if we store monthly segments in Druid, and we set a deletion rule of P3M, what will happen? will it delete the whole month segment? [19:27:18] hm - interesting q mforns [19:27:22] I don't know ! [19:27:37] hm! [19:27:44] mforns_: I assume it'll delete segment when the full month is done, but never tested [19:27:53] I see.. [19:27:56] looking [19:28:11] which means that you actually get almost 4 month data just before new month [19:28:16] But not sure [19:28:26] joal, do we compact pageviews_hourly in monthly segments? yes, right? [19:29:08] mforns_: we don't [19:29:12] oh [19:29:30] mforns_: we have month-segments for pageview_daily and day-segments for pageview_hourly [19:29:40] I see [19:29:53] pageview_daily is not deleted though [19:30:08] mforns_: for the segment-size definition my suggestion depend on segment size [19:30:40] If data is big, small time-portion is better, if data is small, longer portion is better [19:30:53] aha [19:30:54] mforns_: druid likes segments of ~300/700Mb [19:31:09] joal, I assume EL data will be always small... [19:31:23] So, you play with segment-time and number-of-segment per time-period [19:31:43] high throughput schemas have <1000 events per second, which are then aggregated before ingestion [19:31:52] For instance for pageview_hourly, with have day-segments, each having 8 subsegments [19:32:01] aha [19:32:38] subsegments are the shards right? [19:32:41] Actually looking at that, we could go for 4 subsegents per hour :) [19:32:43] but anyway [19:32:50] yes sir [19:33:10] what I call subsegments are indeed named shards in coord [19:33:21] Thanks for the corect naming (I had forgotten) [19:33:36] was just asking, cause wasn't sure [19:33:45] mforns_: easier for segment-size def is to load some, look at data size, and make decision [19:33:52] aha [19:33:55] ok, makes sense [19:34:22] Particularly because of columnar compression, aggregation etc,we have no clue how much a segment will be [19:34:31] Or I don't :) [19:35:11] but joal, then, an monthly segment should ideally have more shards than an hourly segment for the same data no? [19:35:40] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Neil_P._Quinn_WMF) @nettrom_WMF just nudged me to think about whether there are other field nam... [19:35:40] maybe not proportionally more, but more no? [19:36:19] mforns_: the number of shards of segment is defined at indexation, manually [19:36:33] yea, by parameter num_shards [19:36:51] but imagine we have a datasource that we ingest in daily segment granularity [19:36:59] with 4 shards per day [19:37:11] but theoretically, if you have the same data stored either in hour-segs or month-segs, you should provide a lot more shards for month-segs [19:37:23] and then we want to compact it to monthly segment granularity, how many shards would we use there? [19:37:31] exactly [19:37:44] exactly I can't say, but probably a lot more! [19:37:50] xDD [19:38:15] no, I wasn't asking for how many exactly! hehehe, just saying that I agreed with you :D [19:38:19] 30*4 - compaction/aggregation factor, if 4 is the correct number of shards for hour-segs [19:38:31] makes total sense [19:38:37] More or less :-P [19:38:43] :) [19:39:14] you know if the number of shards needs to be power of 2? [19:39:36] No need - I just kinda like it :D [19:39:40] ok [19:40:21] mforns_: IMO the time-segments are interesting for managing segment-size and being better at loading rules (you don't need to wait a full month before dropping some data) [19:40:34] However, this is mostly true for big datasources [19:40:58] For small datasources, well the cost of having a month present while not being used is not that important, because it's small [19:41:08] I see [19:41:44] you know what the compression factor is between daily and monthly? [19:41:50] the factor gain [19:42:24] mforns_: difficult to know actually casue it depends a lot on data [19:42:33] aha [19:43:34] mforns_: But, I assume that it'll not be more than 0.8 [19:43:45] actually, not less than 0.8 [19:43:49] aha [19:43:51] Which is already not bad [19:43:54] yea [19:44:04] But this is completely finger-in-the-air :) [19:44:25] I'm thinking if the monthly compaction pays in the case of EL datasources, because then the deletion would happen monthly... [19:44:28] * joal thinks: you asked me for a number, I'll give you a number :) [19:44:31] with monthly blocks [19:44:34] hehehe [19:44:51] I'm thinking maybe better remove the monthly job... [19:44:58] mforns_: I'm more worried of segment number than anything else [19:45:01] then the P3M rule will be able to delete day by day [19:45:04] aha [19:45:16] that's another interesting thing.. [19:45:29] right now puppet assigns a constant number of shards to each job [19:45:34] If a day of EL data is ~300mb, no problem, we keep daily segments with 1 shard [19:45:51] hourly = 2, daily = 4, monthly = 8 [19:46:02] If a day of data is let's say 10Mb, well it's worth compact it monthly, and reduce the pressure on coord [19:46:19] mforns_: ??? [19:46:35] joal, 1 hour of navigationtiming is <1MB [19:47:11] 1 hour of readingdepth is 12MB [19:47:28] 1 hour of pageissues is 10MB [19:47:28] mforns_: which calendar hour is that? [19:47:34] last hour [19:47:37] oh no [19:47:42] hour 17 CEST [19:47:42] it's 6 hours ago [19:47:43] ? [19:47:46] ok [19:48:01] I''m asking cause data volume changes :) [19:48:05] ok [19:49:02] 10Analytics: Better redirect handling for pageview API - https://phabricator.wikimedia.org/T121912 (10Milimetric) I think the full redirect graph accounting for all redirects becomes possible after we ingest and process wikitext content. It's feasible for the 2018/2019 year, and it's something a bunch of us wan... [19:49:22] We were roughly at peak for that hour [19:49:22] joal, looks like these 2 reading schemas, which are quite big for EL, could have just 1 shard per day [19:49:31] I agree [19:49:48] By security and to faciliate parallelization, let's make them 2 [19:49:54] aha [19:50:16] Actually no - my bad [19:50:19] ok, the hourly num_shard seems could be hardcoded to 1 [19:50:33] 12*24 = 288 --> One shard is good [19:50:53] hourly shard is 1 for sure, even daily is 1 for those sizes [19:51:10] But I'll have no problem to keep daily shard of that size [19:51:37] The ones I wanted to remove for instance were the unique-devices ones - daily shard was a few mb - Not point to keep them [19:52:26] joal, aha [19:52:44] readingdepth is the biggest EL schema right now in terms of througput [19:52:58] maybe we can hardcode the daily shard num to 2 and [19:53:09] in case other big schemas come [19:53:17] actually, default to 2 [19:53:37] hm... [19:53:55] ok, will modify jobs [19:54:51] mforns_: just as an example, pageview_hourly datasource shards are ~201mb [19:55:05] aha [19:55:35] joal, one last question :] druid reduce_memory and spark driver_memory should be similar right? [19:55:47] nope [19:55:54] not the same function :) [19:55:56] oh [19:56:25] spark-driver would be the equivalent of the map-reduce ApplicationMaster in Yarn [19:56:34] aha [19:56:59] And reduce would be an executor in spark (spark doesn't make differences between map and reduces ,as it works all of them indifferently in workers) [19:57:20] I see [19:58:48] joal, so driver_memory can be smaller than druid's reduce_memory [19:59:20] and druid's reduce_memory should be proportional to the data size [19:59:33] mforns_: they are quite different - I'm not sure I understand what spark has to do here :) [19:59:41] Ah maybe I get it [19:59:52] spark will lauch indexations, right? [20:00:12] when you call spark-submit, you pass i.e. --driver-memory 8G [20:00:50] and when you launch an indexation task, you pass i.e. reduce_memory=8192 [20:01:08] spark driver memory depends a lot on the task [20:01:12] (same as reduce) [20:01:41] here your spark task will look at refined folders and launch druid indexation through API call, right? [20:01:48] yes [20:02:04] mforns_: this is very small - driver memory can probably be 2G [20:02:11] ok I see [20:02:22] regardless of data size [20:02:33] As for the reduce memory, this will the memory of the jobs indexing the data [20:03:02] joal, but prior to calling the API, the spark job does some transformations to the data [20:03:03] Given the data is not huge, making it 4g is probably sufficient [20:03:15] aha, makes sense [20:03:25] mforns_: transformation is done in executors, not driver [20:03:42] sure, as you said even reduce... ok [20:03:46] so it should be parallelized [20:04:02] k, coooool! [20:04:11] thanks a lot for all the illuminating answers [20:04:31] If you do: Spark (driver 2G, exec 4G, max-exec 128) and MapRed(2G, 4G) I hope you should be fine :) [20:04:49] np mforns_ :) [20:04:52] max-exec 128? I thought that was a lot [20:05:12] mforns_: it's a lot, but what's the purpose of not using resource if it's availble? [20:05:21] right [20:05:23] ok! [20:05:48] mforns_: The cluster has ~3TB ram, so 1/2Tb is ok ;) [20:08:38] :P [20:32:02] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10JAllemandou) More investigations on this. As @Nuria explained above, we think that the hi... [20:32:07] nuria: --^ [20:34:39] joal: looking [20:35:44] 10Analytics, 10Analytics-Kanban, 10Contributors-Analysis, 10Product-Analytics, 10Patch-For-Review: Decommission edit analysis dashboard - https://phabricator.wikimedia.org/T199340 (10Nemo_bis) [20:38:18] mforns_brb: while looking at crons on an-coord1001, I wonder about the necessitu for hourly data for the various EL-to-druid datasources [20:38:49] I have no real clue on usage, so I can't say [20:42:09] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) Some more context, using our data we can see that you are not sending as many beac... [20:42:26] 10Analytics, 10Analytics-Kanban, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) a:05JAllemandou>03AndyRussG [20:42:32] 10Analytics, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Nuria) [20:43:18] Thanks for comments nuria :) [20:45:43] joal: one less thing to worry about for now [20:46:10] nuria: yes and no - we have a pending decision on whether to use EL or we [20:46:22] webrequest for realtime banners [20:46:38] so yes by some means, no by some others nuria :) [20:48:16] PROBLEM - Throughput of EventLogging EventError events on einsteinium is CRITICAL: 96.91 ge 30 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [20:48:29] ouch [20:48:53] joal: I think webrequest might be painting a false picture of banner impressions and counting some that are shown to users that have never actually seen them (if banners are sent to Js-enabled browsers only). If banners are sent to ALL browsers then EL count will always be smaller. [20:48:56] ayayaya [20:50:16] looking at alarm. [20:52:49] nuria: error seems realted to REadingDepth schema [20:52:56] "schema": "ReadingDepthSchema.enable" [20:53:51] joal: looking [20:59:41] joal, nuria, seems readingdepth is receiving events from another schema? the events include fields app_install_ID, ts and type_of_search, that are not present in the latest revision of readingdepth [21:00:11] mforns: and from I found, schema is not correctly named (ReadingDepth.enable [21:01:15] joal: man [21:01:19] jdlrobson: yt? [21:01:53] wait, no, the erroring schema is MobileWikiAppiOSSearch no? [21:03:14] mforns: it has changed I think (https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1&from=now-3h&to=now-5m [21:03:43] wow that's weird [21:03:47] PROBLEM - Throughput of EventLogging EventError events on einsteinium is CRITICAL: 30.17 ge 30 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [21:04:53] mforns, nuria: may I leave this to ou folks? Seems back to normal ... [21:04:57] RECOVERY - Throughput of EventLogging EventError events on einsteinium is OK: (C)30 ge (W)20 ge 8.838 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [21:05:05] sure joal! [21:06:32] mforns: from briefly tailing logs error is: [21:06:35] https://www.irccloud.com/pastebin/CcMTBLWL/ [21:06:57] yea MobileWikiAppiOSSearch [21:07:29] jdlrobson: it will be worth looking at errors with ReadingDepthSchema.enable [21:09:57] nuria, the table MobileWikiAppiOSSearch does not exist yet in event hive db [21:10:08] this seems that are the first events sent for that schema [21:10:43] maintainers are Chelsy and Josh [21:11:06] but yea, error rate is fine again [21:13:34] 10Analytics, 10Android-app-Bugs, 10Product-Analytics, 10Reading-analysis, 10Wikipedia-Android-App-Backlog: Many errors on ReadingDepth.enable (?) schema - https://phabricator.wikimedia.org/T207423 (10Nuria) [21:13:39] mforns: ok, 1 ticket filed [21:13:41] https://phabricator.wikimedia.org/T207423 [21:13:51] mforns: let's file ticket #2 [21:22:55] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10Nuria) [21:23:38] mforns: filed ticket #2 [21:24:00] nuria, ok, appended an email to the alert [21:26:25] joal, nuria, sorry you had to take over https://phabricator.wikimedia.org/T204396 that was assigned to me [21:32:27] 10Analytics, 10Project-Admins: Create project for SWAP - https://phabricator.wikimedia.org/T207425 (10Neil_P._Quinn_WMF) [21:32:42] 10Analytics, 10Project-Admins: Create project for SWAP - https://phabricator.wikimedia.org/T207425 (10Neil_P._Quinn_WMF) >I think just "SWAP" is fine for the project name, but it could be "Analytics-SWAP" to be similar to other projects owned by the Analytics team. Thoughts, @Nuria, @Ottomata? [21:38:39] nuria: hey what's up [21:40:22] ReadingDepthSchema.enable is not a schema so that's odd [21:41:17] https://github.com/search?q=org%3Awikimedia+ReadingDepthSchema.enable&type=Code [21:41:33] i'm guessing guessing wikimedia.events. assumes foo is a schema [21:42:25] jdlrobson: i guess, sounds like that is the source [21:43:01] when did this start happening? [21:43:32] because this should have happened when a/b test started 2 weeks ago, so if it's new, something changed [21:45:01] possibly related to https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/465208/ ? [21:46:22] nuria: is there a bug open i can tie some patches too? [21:50:41] jdlrobson: yes, and conveniently assigned to you! [21:50:58] jdlrobson: https://phabricator.wikimedia.org/T207423 [21:51:22] jdlrobson: maybe just volume of errors triggered the alarms and that is how we sawed it? [21:52:48] i assume this is ubn ? [21:52:53] 10Analytics, 10Android-app-Bugs, 10Product-Analytics, 10Reading-analysis, and 3 others: Many errors on ReadingDepth.enable (?) schema - https://phabricator.wikimedia.org/T207423 (10Jdlrobson) [21:54:18] 10Analytics, 10Android-app-Bugs, 10Product-Analytics, 10Reading-analysis, and 3 others: Many errors on ReadingDepth.enable (?) schema - https://phabricator.wikimedia.org/T207423 (10Jdlrobson) [21:54:34] nuria: if it is ubn it will look better if that priority comes from you :) [21:56:28] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Pchelolo) The above patch should mitigate the problem, however, we need to also account for possible clock drift between our... [21:57:27] PROBLEM - Throughput of EventLogging EventError events on einsteinium is CRITICAL: 95.38 ge 30 https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=13&fullscreen&orgId=1 [22:03:07] 10Analytics, 10EventBus, 10Growth-Team, 10MediaWiki-Watchlist, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Pchelolo) > Just checked, python jsonschema validates milliseconds ISO-8601s with date-time format just fine. :) Apparently... [22:06:51] 10Analytics, 10Android-app-Bugs, 10Product-Analytics, 10Reading-analysis, and 3 others: Many errors on ReadingDepth.enable (?) schema - https://phabricator.wikimedia.org/T207423 (10Nuria) This is getting to be a huge number of errors, just 20.000 in the last minute. kafkacat -C -b kafka-jumbo1001.eqiad.w... [22:07:02] tally of errors: [22:07:06] https://www.irccloud.com/pastebin/mE0qQEWU/ [22:08:30] jdlrobson: updated priority [22:09:08] jdlrobson: super thanks for the fast response [22:20:31] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics, 10iOS-app-v6.1-Narwhal-On-A-Bumper-Car: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10chelsyx) Thanks @Nuria ! @NHarateh_WMF Can you fix... [22:32:44] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics, 10iOS-app-v6.1-Narwhal-On-A-Bumper-Car: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10chelsyx) [22:41:02] (03PS2) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [22:44:08] (03CR) 10jerkins-bot: [V: 04-1] Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [22:57:35] 10Analytics, 10Patch-For-Review: Many client side errors on citation data, significant percentages of data lost - https://phabricator.wikimedia.org/T206083 (10bmansurov) @Miriam I've submitted a patch to limit the link text to 100 characters and page title to 200 characters. Let me know if these numbers need t... [23:15:26] (03CR) 10Nuria: "Have we tested this script works on those tables and that they are partition according to "snapshot" just like the others?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/468311 (https://phabricator.wikimedia.org/T197888) (owner: 10Fdans) [23:27:45] https://grafana.wikimedia.org/dashboard/db/reading-web-dashboard?orgId=1&panelId=15&fullscreen < nuria looks like we need sentry ;-) [23:28:21] jdlrobson: i KNEW that before [23:28:28] haha but now we have data [23:28:37] 1k errors a MINUTE on wikipedia mobile alone [23:28:40] jdlrobson: it is because I have super powers [23:28:48] well 1-3k [23:29:08] jdlrobson: ya, my guess is that if you add desktop you'll go 10 times that [23:29:08] i smell a project proposal coming on ;-) [23:29:13] cc tgr|away ^ [23:29:43] i really want to know what those issues are o_o [23:30:28] jdlrobson: also that is on the "normal state of affairs", on a bursty error storm due to a bad deploy i bet those go to 100X and then all our systems for EL will go kaput [23:30:41] yup [23:32:03] jdlrobson: anad everyone will be really sad [23:40:03] (03PS3) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [23:46:23] mforns: I had to modify one test, please let me know if I am missing something [23:46:31] mforns: re: memoization