[02:06:50] 10Quarry, 10Documentation: admin docs: quarry - https://phabricator.wikimedia.org/T206710 (10zhuyifei1999) It's part of [[https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Quarry|Data Services]] because of Yuvi I think :) And Quarry has so many users that it's almost production-like. WMF researchers /... [02:38:16] 10Analytics, 10New-Readers: Instrument the landing page - https://phabricator.wikimedia.org/T202592 (10Prtksxna) [03:35:14] I'm trying to run a query from beeline [03:35:16] I got [03:35:21] Caused by: java.lang.OutOfMemoryError: Java heap space [03:35:22] Error: org.apache.thrift.TApplicationException: CloseOperation failed: out of sequence response (state=08S01,code=0) [03:35:43] I'm not sure what to do about that issue. Running out of memory is not something I thought would happen with this query [04:00:13] I adjusted my query and got it to work, but i don't really understand what the difference was that triggered the error [05:20:33] bawolff: o/ - can you show us the query? Maybe it is returning too many recordS? [05:20:54] also you can try to use 'hive' directly [05:21:11] Sorry, I think I overwrote the file I had it saved in [05:22:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: turnilo x axis improperly labeled - https://phabricator.wikimedia.org/T197276 (10elukey) The new Turnilo version will be deployed tomorrow Thu 25th together with Druid 0.12.3 [05:24:20] I know that's really bad for me to ask for help and then delete the problematic query [05:24:35] sorry [05:26:21] bawolff: it is completely fine, we are here to help, lemme know if you need any review later on! I am not the best one to give advices on a query but I can surely forward the question to the most knowledgeable ones :) [05:27:30] As an additional question, is there a more sane way to save the results of my query to a file other than piping redirection (Which seems to get a bunch of CR characters randomly thrown in, and some other stuff from the prompt) [05:38:57] anybody here that knows what's going on with kafka? [05:39:18] it looks like timestamp format has changed and wdqs updater can not read it anymore [05:39:35] bawolff: check INSERT OVERWRITE LOCAL DIRECTORY for hive, it should do the trick [05:39:54] SMalyshev: Kafka main? [05:39:57] thanks [05:40:11] elukey: probably, I don't know which one is main [05:40:27] kafka1001.eqiad.wmnet:9092,kafka1002.eqiad.wmnet:9092,kafka1003.eqiad.wmnet:9092 [05:40:32] that's the broker string [05:40:33] yes that one :) [05:40:53] so I am not aware of any format changes [05:41:19] can you give me more info so I can check? Like timeline (if you have) and format change? [05:41:40] well something must have happened because now Updater can not read kafka messages and several hours ago everything was fine [05:42:34] Oct 24 00:28:30 wdqs1003 wdqs-updater[1614]: 00:28:30.876 [main] WARN o.w.q.r.tool.change.JsonDeserializer - Data in topic eqiad.mediawiki.revision- [05:42:41] I am checking https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?from=now-24h&to=now&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=wdqs1004 [05:42:46] so revision create right? [05:42:54] .q.r.tool.change.JsonDeserializer - Data in topic eqiad.mediawiki.revision-create cannot be deserialized [{"comment": "Stabl", "database": "mediawiki [05:43:03] looks like from 00:35 UTC [05:43:09] oops can't copy it properly [05:43:14] Oct 24 00:28:30 is the first error message [05:43:26] ack that matches my graph more or less [05:43:29] Cannot deserialize value of type `java.time.Instant` from String "2018-10-24T00:28:24.162300+00:00" [05:43:47] so something happened to timestamp format [05:43:59] because before that it was fine [05:44:42] SMalyshev: qiad.mediawiki.revision- │ [05:44:47] uff sorry [05:44:51] https://tools.wmflabs.org/sal/production [05:45:00] so at 00:28 there was a mw deployment [05:45:43] aha, so who should be asked about it? [05:46:30] I think Mukunda, he did the deployment, but he is probably not online.. I am wondering if we can pin down a change [05:47:05] elukey: do you know whether dt in kafka comes from mediawiki? [05:47:29] I was about to ask, I don't know on top of my head and caffeine level is still low :D [05:47:35] but I can try to figure it out [05:47:49] Andrew is the master in these things, he'd have answered in a sec :D [05:48:13] in theory, it should be an eventbus event that gets sent by mediawiki [05:50:16] SMalyshev: do you know if all events have the wrong timestamp or only a few? [05:50:28] yesterday it was deployed only the group 0, so test wikipedias [05:50:28] In some of my queries where the result might contain newline, the output of beeline seems a bit messed up (The word NULL everywhere, mixed with ^A, ^B, ^C, etc). any idea what's up with that? [05:50:37] elukey: I don't know but probably all of them... [05:51:10] I'll check in a minute, I want to get updater back to live first [05:54:42] so I've done this on stat1004 [05:54:43] kafkacat -C -b kafka1001.eqiad.wmnet:9092 -t eqiad.mediawiki.revision-create [05:55:18] yeah so what do you get? [05:55:23] "dt": "2018-10-24T05:54:24+00:00" [05:55:36] that's the good one I think [05:55:40] for commons [05:56:02] so maybe the events that are causing the failures are a few [05:57:27] elukey: ok, let me see what is that message [06:36:09] (we are following up in #operations) [07:13:09] ok back into a good shape [07:14:44] elukey: thanks for your help! [07:20:29] np!! [07:24:05] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Joe) I think in general it's ok to go with the nodejs rewrite - I only hope we've evaluated carefully that this service will... [07:30:49] to add more fun, eventlog1002 still alarms for logs [07:30:56] /dev/mapper/eventlog1002--vg-data 870G 778G 48G 95% /srv [07:31:13] so I think it is the time to reduce the retention to 7 days [07:56:56] /dev/mapper/eventlog1002--vg-data 870G 601G 226G 73% /srv [07:56:58] better now [08:01:08] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Upgrade to Druid 0.12.3 - https://phabricator.wikimedia.org/T206839 (10elukey) [08:01:10] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: eventlogging logs taking a huge amount of space on eventlog1002 and stat1005 - https://phabricator.wikimedia.org/T206542 (10elukey) [08:01:13] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: turnilo x axis improperly labeled - https://phabricator.wikimedia.org/T197276 (10elukey) [08:14:15] * elukey coffee [09:03:58] (03PS1) 10Amire80: Add scheduling for Content Translation MT engine data [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/469390 [09:05:01] (03PS2) 10Amire80: Add scheduling for Content Translation MT engine data [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/469390 (https://phabricator.wikimedia.org/T207765) [09:57:58] joal: (whenever you have time) - I have duplicated the banner_impression dir in refienery as banner_impression_el just to familiarize with the oozie config files [09:58:25] my idea would be to start with the "batch" procedure before adding the Kafka indexing service [09:58:32] job [09:59:10] basically generating the data from event.centralnoticeimpression, rather than webrequest [09:59:16] (via hive) [09:59:36] and then load daily/monthly as we do now to Druid [10:03:33] but now that I think about it, the banner impression workflow could in theory do also the eventlogging things [10:03:41] so we'd have only one coordinator doing both [10:03:46] not sure though what's best [10:07:55] (I'd prefer to keep the coordinators as separate as possible to avoid interference, but it duplicates code of course) [10:08:02] Hallo [10:13:01] hi :) [10:15:40] Remind me please: If I create a new Schema in Meta, and then call mw.track, is this enough to get events logged? Or do I need to whitelist it anywhere? [10:16:39] No idea, I'll ask to other team members when they'll be online! [10:17:26] aharoni: it is eventlogging right? Also, with "events logged", do you mean to mysql or hdfs? [10:17:45] elukey: yes, EventLogging, and probably mysql [10:18:42] ok so if you need mysql, we'll need to explicitly whitelist the schema since now we only import to HDFS by default [10:19:00] and we whitelist to mysql only if the schema has a low rate of events [10:19:17] (there are some scalability issues with high rate events and mysql insertion) [10:27:15] elukey: What is a "low rate of events"? I think that this one will have maybe several hundreds per day. [10:30:20] that is fine, we have some schemas now that emits hundreds of events per second [10:30:25] those are the problematic ones :) [10:30:54] our general direction is to move to HDFS only eventually, so it would be great to add mysql only if really needed [10:31:38] joal: so I am attempting a refactoring to make banner impression generic, I'll probably fail miserably but I'll try to send a code review later on :P [10:35:14] elukey: what's hdfs? hive and all that? [10:38:51] aharoni: yes exactly [10:39:13] storage wise we can scale horizontally HDFS/Hadoop, not mysql [10:39:29] so having all our data on HDFS makes sense for us in the longer term [10:39:33] elukey: OK, and that one doesn't need whitelisting? Do I need to specify anywhere that it will logged to HDFS? [10:39:44] will ^be^ logged [10:40:03] as far as I know it will be automagically imported in hive [10:40:11] but I need to verify it [10:40:12] :) [10:40:20] I am sure about the mysql whitelist though [10:50:20] elukey: sorry, disconnected [10:50:37] elukey: so if the Schema page exists, where do I see it? [10:50:39] on stat1005? [10:50:55] is the table autocreated? [10:52:52] aharoni: as far as I know it gets imported to the hive's event database, but as said above I'd need to verify with my team [10:53:07] do you have a specific schema already taking events? [10:58:43] elukey: for example Schema:UniversalLanguageSelector [11:00:16] elukey: the new one I'm creating is Schema:ContentTranslationAbuseFilter [11:00:48] The extension code that logs to it is supposed to be deployed to production later today. [11:04:35] hive (event)> select count(*) from universallanguageselector where year=2018 and month=10 and day= 23; [11:04:42] 93 [11:04:56] so yep it seems working :) [11:05:15] I did the following (from either stat1004 or stat1005) [11:05:27] hive -> use event -> query [11:05:42] you can also check with "describe universallanguageselector" [11:05:47] the various fields [11:06:08] now that I think about it, this might be the raw events db [11:06:51] lemme check [11:09:07] nope it is not [11:09:24] so yes aharoni you can use the event database to check your events via hive [11:09:47] as soon as Schema:ContentTranslationAbuseFilter will get events data will start to be populated in there too [11:23:14] elukey: hmm... so if I do [11:23:14] use [11:23:17] sorry [11:23:23] > beeline [11:23:37] > use event; [11:23:41] > show tables; [11:23:49] then I'm supposed to see the new table? [11:24:14] "contenttranslationabusefilter" [11:25:27] I tried doing `mw.track( 'event.ContentTranslationAbuseFilter', {`, etc. [11:25:32] and I don't see the table created. [11:26:03] aharoni: you can use beeline or better directly 'hive' [11:26:20] the event's ingestion is not real time, so it takes a bit before you can see your data [11:26:27] it is partitioned by hours [11:26:33] when I run hive, I get a message that suggests using beeline [11:26:49] yeah we should really change it [11:27:00] heh [11:27:03] :) [11:27:13] so is hive better, despite this message? [11:27:47] so basically after you send the event, it gets pushed to kafka. Then eventlogging process the raw events (validating it against the schema etc..) and then it pushes it to a new topic, that is schema specific [11:27:58] (and this is not instant, it takes a bit) [11:28:17] then we have a spark job that pulls the topic and imports the data into hive [11:28:21] so you can query it [11:28:52] so how much time does it take until the event is actually logged and can be queried in hive on stat1005? [11:29:17] 10 minutes, 1 hour, 10 hours? [11:29:41] I'd condider one hour [11:31:00] also aharoni, I am checking the kafka topics [11:31:23] kafkacat -C -b kafka-jumbo1001.eqiad.wmnet -t eventlogging_ContentTranslationAbuseFilter -o beginning [11:31:28] and no events are there [11:31:48] (this is the second step when the event is validated) [11:32:14] if you want to simply tail, remove -o beginning (that will start from the first event) [11:33:50] elukey: on which server can I run this? is kafka the hostname? I never logged into it [11:34:02] I usually use stat1005 and mwmaint1002 [11:36:04] stat1005 is fine [11:36:08] or stat1004 [11:38:16] so... the code change hasn't been deployed yet. I'm just a bit impatient and trying to test it :) [11:38:41] I tried running mw.track in the JS console twice and I don't see anything [11:42:34] aharoni: I am seeing the events among the validation errors [11:42:35] (Invalid revision ID -1) [11:43:09] aha! [11:43:13] what revision is this? [11:43:18] the revision of the schema page? [11:44:20] so if the extension code change that includes the configuration of the schema name and revision is not deployed yet, can it be the reason for this error? [11:45:08] it might, I am not familiar with this part.. [11:52:15] OK, I guess I'll patiently wait until this evening when it's deployed :) [11:52:31] ack ;) [12:02:31] (03CR) 10Amire80: [C: 04-2] "Please don't merge or deploy yet. We need to check a couple of things first." [analytics/limn-language-data] - 10https://gerrit.wikimedia.org/r/469390 (https://phabricator.wikimedia.org/T207765) (owner: 10Amire80) [12:02:47] Hi elukey - kids day today :) About the refactoring of banners_impression oozie jobs, aiming for generic is very difficult [12:03:02] joal: yep I discovered it the hard way :) [12:03:20] it is super sad though, I got to the point in which only coordinator.xml's names differ [12:03:22] elukey: Marcel is moving in that direction with EL2Druid, but it is difficult [12:03:33] but then I got stuck [12:03:55] elukey: major difficulty is not scheduling, it is data-schema [12:03:55] because I wanted to discuss it with you (I know today is kids day :) [12:04:46] joal: would it be feasible to just duplicate coordinator.xml in two separate dirs, and then symlink the rest? [12:05:06] and of course .properties should be duplicated [12:05:53] elukey: I'd need to review in detail, but I have the feeling that some things need to change [12:06:26] hi joal . my discussion with elukey above is in the context of https://phabricator.wikimedia.org/T189475 , about which I had a meeting with you a few weeks ago. Is there any reason for me to log to MySQL? I'm probably fine with logging to hive. [12:07:05] elukey: For instance we'd need a dataset.xml file referencing the data availability (I think), the druid-template json file is probably not gonna be the same, and I wonder if we really need a hive step in the middle given data is in parquet format [12:07:32] aharoni: Hello :) [12:08:43] joal: I checked the data that we load now and wanted to keep the same, for example the bits that add rate, geocode, etc.. this is why I though that data needed to be generated as well. The table produced is the same as the webrequest case, so the druid ingestion template should not change [12:08:53] if we want a different thing though we can change anything [12:08:59] aharoni: I think Hive should do the job (any analytics-oriented query you'd have in mysql will run in hive - with adaptations) [12:10:10] makes sense elukey - I don't know what's best here - We should probably ask FR, also to double check that EL is better than webrequest in regard to number validity [12:10:42] aharoni: Once data is in hive, you want to run a query and generate a report, right? [12:15:22] joal: yes. see the plan with checkboxes at the top of https://phabricator.wikimedia.org/T189475 . briefly: log events; create another table, and copy relevant data from the events table to the new table; use the new table in superset. [12:16:02] Right aharoni - The reason for which Mysql is the way to go then is for superset [12:16:49] aharoni: Hive is really not a superset friend (connection is done, but queries latency makes the thing unqsable) [12:16:50] joal: Aha! So should the events also be logged to MySQL? [12:17:13] aharoni: Yes - We do not push data from hive to mysql [12:17:32] So for the final events to be in mysql it's better to have the original ones in Mysql as well [12:17:44] joal: OK. In that case, I guess that I have to ask to whitelist this schema to log to MySQL? [12:17:48] How do I do this? [12:18:44] it is a simple puppet change [12:19:01] thanks elukey --^ I couldn't remind and was about to ask :) [12:19:53] elukey: do I have to request it somewhere? Or submit a patcH? [12:20:17] aharoni: I can do it now, what is the schema? [12:20:19] or the schemas [12:20:38] elukey: https://meta.wikimedia.org/wiki/Schema:ContentTranslationAbuseFilter [12:22:44] should be https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469419/ joa [12:22:47] joal --^ [12:23:06] aharoni: going to validate this with Andrew when he'll be online, we'll try to merge asap [12:24:35] Thanks elukey :) [12:24:51] elukey: have you seen the patch I sent yesterday about hive-parquet logging? [12:24:57] elukey: thanks :) [12:25:16] would be nice to get this deployed today before the train deploys to Catalan and Hebrew Wikipedia. [12:25:22] There are a few more hours till them. [12:25:24] then. [12:25:31] joal: nope! [12:25:51] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/cdh/+/469256 [12:25:56] joal: did you add me as reviewer? [12:26:11] weird gerrit doesn't show it to me [12:26:25] elukey: just did that - ottomata suggested I rename the logging-properties file as a generic java-logging.properties file [12:26:36] elukey: any other comment very welcome :) [12:27:16] I am not going to get into a naming dispute with Andrew :P [12:27:25] whatever he thinks it is best it is ok for me :D [12:27:39] looks good! [12:28:16] Maaaaaaan - Not even a single -1 on a puppet patch !!! [12:28:19] * joal feels proud :) [12:40:01] elukey: updated the patch with generaic java-logging name [12:42:46] elukey: has anything changed lately in EL ? [12:43:12] joal: can you be a bit more specific? [12:43:49] elukey: Since yesterday and the "big EL topic needs more partitions" thing, I have grafana open [12:44:14] elukey: And there are some weird patterns in disk IOPS for instance on jumbo [12:44:20] ah no I haven't touched anything, not sure what you guys did [12:44:41] elukey: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=18&fullscreen&orgId=1&from=now-24h&to=now&var-server=*jumbo*&var-network=eno1 [12:45:04] spikes between 5 and 7, and big jump now [12:45:39] joal: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=19&fullscreen&orgId=1&from=now-24h&to=now&var-server=*jumbo*&var-network=eno1 [12:45:45] Those seem to happen one every now and then (looking at 7 days now) [12:46:26] they might be due to partition imbalance, but overall it is like 3% of disk usage, seems something ok [12:46:35] elukey: ok elukey [12:46:38] thanks :) [12:46:46] I mean, nothing horribly wrong, buuut it would be great to track it down [13:10:16] * elukey afk for a bit! [13:24:30] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ayounsi, for this ticket, shall we ask for these to be set up in the public VLAN? [13:39:22] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10ayounsi) That sounds good to me but will have @faidon doublecheck. Ideally please distribute those servers across multip... [13:51:44] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Ottomata) > Was this considered? What is your evaluation in terms of how computationally intensive this process is? I'd have... [14:04:28] ottomata: o/ [14:04:37] ok if I run puppet on stat1007? [14:09:31] sure! [14:26:24] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Pchelolo) > service-template-node uses service-runner, which is basically a prefork worker model (using nodejs cluster) (righ... [14:26:57] puppet is deploying a ton of things to stat1007 :D [14:30:40] ah ottomata did you see the weird issue with wdqs-updater happened this morning? [14:31:45] it seems that yesterday the mw deploy for group 0 (test + mediawiki.org) changed the revision-create dt timestamp format [14:32:20] breaking that wdqs-updater (that wen awol then because of too may exceptions etc..) [14:34:42] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1007 is CRITICAL: NRPE: Command check_check_hadoop_mount_readability not defined [14:35:09] lol [14:45:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) a:03elukey [14:50:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [15:04:46] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1007 is OK: OK [15:07:08] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Faidon asked for a diagram to help understand the data flow. Here we go! {F26768261} [15:22:02] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, and 2 others: RFC: Modern Event Platform: Stream Intake Service - https://phabricator.wikimedia.org/T201963 (10Ottomata) Oops, typo, meant to write > The above should not be CPU intensive (fixed in comment) [15:24:12] elukey: huh! [15:24:25] elukey: for all revision creates?!?!?! [15:24:31] from EventBus extensions??? [15:24:35] i did not see that [15:25:12] https://phabricator.wikimedia.org/T207817 :( [15:25:22] no no only few ones [15:25:27] like mediawiki.org's [15:26:33] OH the millisecond change! [15:32:01] mforns: joal elukey interesting: https://grafana.wikimedia.org/dashboard/db/kafka-by-topic?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eventlogging_ReadingDepth&from=1540376845793&to=1540390823296 [15:32:15] looking [15:32:16] so we still see some of the weird produce behavior, but it is less pronounced [15:32:20] also interesting [15:32:37] there def seems to be a threshold where there is enough data in webrequest that camus is consuming [15:32:39] where this causes a problem [15:32:44] i betcha this is a page cache issue [15:32:55] when peak webrequest time rolls around [15:33:16] aha [15:33:23] there are enough webreuests that consuming every 10 minutes isn't enough to get all webrequests before they are evicted from page cache [15:33:30] so at the beginning of the webrqeuest camus run [15:33:39] the first messages it has to consume have to be read from disk [15:33:54] mforns: you may be right: more nodes would help here. [15:33:55] or [15:34:01] consuming more frequently [15:34:08] either continuously with akfka connect (or whatever) [15:34:11] or even with camus [15:34:18] i wonder if we could schedule camus every 5 minutes? [15:34:34] what is the page cache issue? [15:34:54] xD, I didn't say anything about more nodes, I think it was joseph or luca [15:35:04] that hitting the disk would cause a slowdown in kafka? [15:35:31] elukey: yes [15:35:38] which is affecting produce latency [15:35:50] this is just my theory! [15:36:11] ottomata, but why eviction? even if eviction took place, wouldn't the cache still be cold? [15:36:14] could be! I just wanted to know the meaning of the "page cache issue" thing :) [15:37:49] not sure if this shows it or not... [15:37:49] https://grafana.wikimedia.org/dashboard/db/kafka?panelId=30&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&from=1540375200000&to=1540389600000 [15:38:00] it kinda lines up [15:38:14] elukey: kafka keeps as much as it can of recent messages in page cache [15:38:19] oh wow [15:38:44] understand [15:38:53] we could increase cach size? [15:39:00] ottomata: sure I know it :) [15:39:04] no, its using all avail ram [15:39:16] elukey: ok so i'm guessing that around peak time [15:39:42] 10 minutes is enough time for webrequest messages to be pushed out of page cache by more recent stuff [15:39:57] so when camus starts back up, it has to start up from where it last left off [15:40:10] i think that during peak time 'where it last left off' is now on disk, and no longer in memory [15:40:26] I buy it [15:40:28] so when camus starts, it causes kafka brokers to do a bunch of diskio [15:40:54] more nodes (which == more RAM == more page cache) would help. [15:41:04] but consuming sooner would help too [15:41:11] before the messages are pushed out of cache [15:41:37] makes sense [15:42:25] mforns: the page cache is managed by the kernel basically, that tries to do what it is best to keep things in free ram [15:42:43] aha [15:43:00] (so not really tunable except some sysctl stuff that are probably super complicated and not helpful) [15:43:08] i see [15:46:46] anyway, i'm still for increasing the partitions on the other high volume eventlogging tpics too [15:46:48] objections? [15:47:03] it doesn't really 'solve' this problem, but it doesn't hurt [15:47:35] ok [15:48:55] what Andrew is saying, if I got it correctly, is that webrequests comes from producers and they are kept as pages in memory before flushing them to disk, so consumers can pull stuff from ram directly [15:49:42] ottomata, I see that IO peaks a couple minutes before the 10 minute interval (12:10 -2 minutes, 12:20 -2 minutes, etc..) If the cach was not able to keep 10 minutes of webrequests, wouldn't the IO rocket just at the start of the 10 minutes range? Or isn't that when the camus job starts? [15:50:26] ottomata: mmm we also don't set any of the log flushing settings in kafka jumbo right? [15:50:34] the defaults are super high [15:53:34] Hello. T198176 fixed an issue related to deleting pages with many revisions. In the process, it subtly changed a behavior that might be related to analytics, as described at the very bottom of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/456035 Is that change an issue for the Analytics team? [15:53:35] T198176: Mediawiki page deletions should happen in batches of revisions - https://phabricator.wikimedia.org/T198176 [15:56:57] elukey: hm maybe? super high as in super often? [15:57:28] i think the log flushing wouldn't really affect this here...unless it is the write IO that is causing the problem... [15:58:09] elukey: i just added a disk_device template var to the kafka dashboard [15:58:14] so we don't ahve to look at sda [15:58:26] write IOps seems relatively consistent [15:58:45] as does iowait...except for kafka1005 [15:58:56] super high in the sense that it is basically not done at all [15:58:59] and 1002 [15:59:02] which, makes sense [15:59:05] so the kernel has to do it [15:59:19] when for example it drops pages from the cache [15:59:22] oh elukey that is probably better tho, right? i mean, if we did it more often, we'd get more write IO [15:59:26] but more relability [15:59:40] but we rely on the persistance from replication rather than disk io stuff [15:59:46] if a broker dies and doesn't get to flush its stuff [15:59:57] when it comes back up it shoudl just re-replicate the missing logs [16:00:04] and then eventually let the kernel flush them like normal? [16:00:05] sure, I am wondering though we could end up with a lot of things to flush to disk [16:00:27] anyhow just a thought [16:00:39] yeah [16:00:39] you know [16:00:42] i betcha https://phabricator.wikimedia.org/T207768 would solve htis [16:00:51] its really only 1002 and 1005 that are showing the spikes in iops [16:01:12] let's see :) [16:01:38] yeahhhhh we should probably do that before we do other things [16:02:02] ping joal [16:32:48] elukey: they used the hammer on the Hubble :) https://gizmodo.com/hubble-telescope-s-broken-gyroscope-seemingly-fixed-aft-1829934018 [16:33:44] ahahahha [16:35:29] (03PS1) 10Elukey: Add stat1007 to the list of refinery targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/469458 (https://phabricator.wikimedia.org/T205846) [16:36:12] (03CR) 10Elukey: [V: 032 C: 032] Add stat1007 to the list of refinery targets [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/469458 (https://phabricator.wikimedia.org/T205846) (owner: 10Elukey) [17:00:44] (03CR) 10Milimetric: "\o/" [analytics/turnilo/deploy] - 10https://gerrit.wikimedia.org/r/469198 (https://phabricator.wikimedia.org/T197276) (owner: 10Elukey) [17:15:57] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10nettrom_WMF) @Ottomata Switching to camel_case makes sense. It results in a couple more fields... [17:17:55] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Ottomata) > user_editcount doesn't follow the snake_case convention, but instead mirrors the na... [17:34:38] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Schema Registry - https://phabricator.wikimedia.org/T201063 (10Ottomata) [17:36:34] 10Analytics, 10EventBus, 10MediaWiki-Watchlist, 10WMF-JobQueue, and 5 others: Clear watchlist on enwiki only removes 50 items at a time - https://phabricator.wikimedia.org/T207329 (10Krinkle) Does not appear to be about a warning, error or fatal emitted from a WMF production service. [17:48:24] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Services (watching): Modern Event Platform: Schema Registry Implementation - https://phabricator.wikimedia.org/T207869 (10Ottomata) p:05Triage>03Normal [17:54:29] bpirkle: (cc joal) is the issue that there was a count of revisions for the page that will "shrink" as revisions are incrementally deleted? On our end I am not sure but I do not think we trust mediawiki to tell us any count for deletions but rather we calculate it after the fact [17:59:01] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Nuria) one note: sending page_title can run into issues of message length being too long (spec... [18:00:03] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Nuria) We also recommend to have a whole new schema rather than using teh old one and revamping... [18:02:04] nuria: there will be only one report (via the ArticleDeleteComplete hook) of the # of revisions deleted per deletion operation. Previously, it would tell you the # of revisions the page had at deletion. Now it will tell you the # of revisions the page has in the archive table after deletion. Those two #s can differ if a page was deleted, then only some revisions were restored, then the page is deleted again [18:03:03] If you're not using the number from the ArticleDeleteComplete hook, then this won't affect you [18:05:47] (03CR) 10Milimetric: [C: 04-1] Memoizing results of state functions (033 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [18:05:55] bpirkle: i think much predates our current team and we really do not read analytics from mw itself rather we calculate them from the events happened cc joal to confirm [18:06:49] nuria: ok, that would be good news. If it turns out this does affect you, please let me know. [18:09:07] bpirkle, nuria: I confirm we do not use mediawiki-hooks stats aywhere I know (I can't sa for wikistats 1, but being based on dumps, I assume it doesn't either) [18:09:12] also confirmed :) [18:09:32] though thanks for telling us, because it's good to know about changes like this [18:09:33] Thanks milimetric :) [18:09:38] 👍 [18:10:30] bpirkle: not sure if we can help in providing better number as of now - we are in the middle of precisely trying to better at deleted pages data [18:10:34] milimetric: --^ [18:11:51] ottomata: Shall we add partitions to VirtualPageview and CitationUsagePageLoad topics? [18:11:54] not sure either whether or not we can tell how many rows are deleted from the logging table, probably not [18:12:11] sounds like the only source was this number in the hook, and that's not captured anywhere in the db [18:12:27] (would've been great actually, with deciphering partial restores) [18:13:32] right milimetric [18:14:38] milimetric: re: https://gerrit.wikimedia.org/r/#/c/468205/4/src/router/index.js@57 [18:16:11] milimetric: the function cannot work downstream unless cache is defined as the module is exporting just the function itself, correct? [18:16:14] milimetric: [18:16:17] 10Analytics, 10Project-Admins: Create project for SWAP - https://phabricator.wikimedia.org/T207425 (10Milimetric) @Aklapper: subprojects sound great, I didn't know about them. And they would work for most of our projects (a few exceptions). @Neil_P._Quinn_WMF, would a subproject work for you in this context? [18:16:20] https://www.irccloud.com/pastebin/6JPbgUSx/ [18:16:41] nuria: no, it's fine to have state local to the module, nothing bad happens [18:16:50] I ran that code, works fine [18:17:19] in fact, that's how vuex works [18:17:52] all the state's defined in the store and we interact with it purely with functions that are imported from that module [18:17:52] milimetric: mmmm but vuex manages the scope, in this case teh objects of the module are not scoped to teh app [18:18:30] milimetric: i have a meeting in 10 minutes but we can talk after [18:18:48] they're scoped to the module, and the functions are part of the module, I admit I'm not 100% exactly sure how it manages memory, but it works fine and doesn't duplicate those objects when importing router/index.js [18:18:51] k [18:22:38] 10Analytics, 10New-Readers: Instrument the landing page - https://phabricator.wikimedia.org/T202592 (10Milimetric) Well, but http://es.wikipedia.org/bienvenida is not a normal wiki URL, like http://es.wikipedia.org/wiki/PageWithPageviews. So it won't be tracked by the pageview tool, it would only be available... [18:25:26] 10Analytics, 10New-Readers: Instrument the landing page - https://phabricator.wikimedia.org/T202592 (10Nuria) Clarifying: Talked to @Prtksxna and team is actually working on a static micro-site, url pending. [18:26:15] * elukey off! [18:29:41] joal i'm actually curious about this [18:29:51] https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&panelId=30&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-cluster=kafka_jumbo&var-kafka_broker=All&from=now-24h&to=now&var-disk_device=sdb [18:29:53] ottomata: yes? [18:29:55] the spikes here are only on 1002 and 1005 [18:30:04] the brokers with the extra webrequest_text partitions [18:30:13] i'd kinda like to rebalance those first [18:30:16] before we make more cahnges [18:30:20] just to have a controlled experiement [18:30:33] in general i think we should add more partitions to those topics [18:30:39] but i kidna want to see if that would actually fix things first [18:31:11] Works for me ottomata :) [18:31:14] Before we do that [18:31:44] Can we check which brokers are masters for the 2 big topics left? [18:31:48] ottomata: --^ [18:32:01] I wouldn't be surprise if it were 2 and 5 :D [18:32:08] ya [18:32:09] ... [18:32:40] ottomata: also, I actually don't have the right to log onto jumbo [18:32:48] Wanted to check for myself, but no luck ;) [18:33:08] VirtualPageiew 1002 [18:33:10] CitationUsagePageLoad 1004 [18:33:17] 10Analytics, 10Analytics-Dashiki, 10Google-Code-in-2018, 10goodfirstbug: Add external link to tabs layout - https://phabricator.wikimedia.org/T146774 (10Milimetric) [18:35:16] interesting [18:40:54] ottomata: when looking a t small and regular spikes, 1002 has biggest ones, then comes 1004, finally 1005 [18:41:36] ottomata: Looks like we actually need both rebalancing of webrequest, AND bumping topic-partitions :) [18:42:32] aye indeed [18:42:45] i just wanna do rebalance first for controlled experiement! [18:42:50] and doing so isssss not unrisky! [18:43:01] kinda want to sync up with luca and schedule it [18:43:19] we have grooming tomorrow so we can talk about that then [18:43:20] ya ok? [18:43:21] wow team - I'm very sorry I completely missed SoS today :S I had noticed the time had changed but didn't look up my calendar during the day [18:43:39] sounds good [18:44:16] I wonder if going for partition-growth first doesn't make it a control experiment as well :) [18:44:36] But since nothing die as of now, no big deal if we wait a bit more [19:02:23] kinda joal, we did it for just one schema, and it is still showing the same signs [19:02:27] but less severely [19:03:18] ottomata: no big deal :) We can rebalance first and then bump [19:03:49] ottomata: The other thing this makes me think of is that if one of our kafka-nodes fail, the volume might be problematic [19:04:26] ottomata: completel different topic (T206279) [19:04:27] T206279: Hive join fails when using a HiveServer2 client - https://phabricator.wikimedia.org/T206279 [19:04:39] ottomata: Can we bump HiveServer2 memory? [19:09:48] ottomata: from https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?panelId=4&fullscreen&orgId=1&var-server=an-coord1001&var-datasource=eqiad%20prometheus%2Fops, it seems an-cooed1001 has some spare RAM :) [19:12:39] joal i suppose! [19:13:04] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: mediawiki_history datasets have null user_text for IP edits - https://phabricator.wikimedia.org/T206883 (10JAllemandou) > But be that as it may - shouldn't the [[https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history | docum... [19:16:03] joal: (passing by) I would attach jconsole to the jvm and run Neil's query to see if we are really saturating 6g of heap memory first :) [19:16:27] elukey: +1! Let's do that [19:16:31] (sadly we have only metastore's jvm metrics - https://grafana.wikimedia.org/dashboard/db/analytics-hive?orgId=1&from=now-7d&to=now) [19:16:59] ack, it can be a quick experiment tomorrow :) [19:30:10] joal: merged your patch [19:30:10] +# Adding java-logging configuration file as hadoop-client option [19:30:10] +export HADOOP_CLIENT_OPTS=-Djava.util.logging.config.file=/etc/hive/conf.analytics-hadoop/java-logging.properties [19:30:10] + [19:30:14] does it work? [19:30:15] check on stat1005 [19:34:24] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10nettrom_WMF) @Ottomata and @Nuria : This is proposed as a revision of the current schema. Creat... [19:40:33] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Ottomata) I think you should really consider making this a new schema, as this is a backwards i... [20:08:31] hey ottomata eqi stat1005 [20:08:33] oops [20:12:09] ottomata: it worked, but needed me to create the /tmp/hive-parquet-logs folders and put write rights on it [20:12:19] oh [20:12:25] :( [20:12:27] yeah you should add that to puppet joal [20:12:30] make it create the folder [20:12:33] ottomata: could it be puppetomatized? [20:12:35] yeah [20:18:42] ottomata: just pushed a patch - probably something must be wrong :0 [20:19:01] Hey analytics engineers, could I please get a code review decision on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventLogging/+/467862 ? [20:19:21] joal, maybe just root, root [20:19:23] instead of hdfs? [20:19:28] doesn't seem to be a reason for it to be hdfs [20:20:25] Nuria has commented on it some, but now that other related patches have been merged, we either need to merge all of them (including this one) before the next train, or revert all of them [20:20:48] works for me ottomata - root it'll be [20:20:48] RoanKattouw: looking [20:21:20] done ottomata [20:22:07] RoanKattouw: ah, i thought this one was not merged due to other issues with CI, looks good. [20:22:23] One of the related patches was, and thanks to Kunal the CI issue was fixed yesterdayt [20:23:12] RoanKattouw: ah ok, that I think is what confused me , this one looks good, just +2 [20:35:29] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10Nuria) @nettrom_WMF FYiI that there is no additional work on needed on your end to create a new... [20:48:09] (03CR) 10Nuria: ">But then you have a point that the string depends on the metric, so it needs to be >in config (which we could localize as well)" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468964 (https://phabricator.wikimedia.org/T206968) (owner: 10Fdans) [21:05:41] (03CR) 10Milimetric: "I think the argument is that "anonymous user" wouldn't apply to null strings coming on the top articles metric, for example. So then the " [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468964 (https://phabricator.wikimedia.org/T206968) (owner: 10Fdans) [23:45:43] (03PS7) 10Nuria: Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) [23:46:05] 10Analytics, 10Analytics-Data-Quality, 10Contributors-Analysis, 10Product-Analytics, 10Growth-Team (Current Sprint): Resume refinement of edit events in Data Lake - https://phabricator.wikimedia.org/T202348 (10nettrom_WMF) Thanks for the feedback, @Nuria and @Ottomata! Discussed this with the Growth Team... [23:46:06] (03CR) 10Nuria: Memoizing results of state functions (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [23:48:27] (03CR) 10jerkins-bot: [V: 04-1] Memoizing results of state functions [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria) [23:50:10] (03CR) 10Nuria: [C: 04-1] "-1-ing myself until i can look at tests" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/468205 (https://phabricator.wikimedia.org/T207352) (owner: 10Nuria)