[01:56:53] is there a way to reach github from the stat hosts? [01:57:17] I vaguely remember there being some kind of proxy but can't find details anywhere [02:04:01] tgr: https://wikitech.wikimedia.org/wiki/HTTP_proxy :) [02:04:13] * elukey is not happy with his jet lag [02:24:04] thanks! [02:24:27] I swear I spent five minutes searching on various combinations of 'proxy' on wikitech [08:57:33] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3945294 (10elukey) Following up on my latest comment: https://github.com/confluentinc/confluent-kafka-go/issues/109#is... [10:19:41] joal: o/ - I am almost done with https://etherpad.wikimedia.org/p/analytics-hadoop-java8 [10:19:50] adding the last commands and comments [10:56:18] anybody knows what /etc/eventlogging.d/forwarders/legacy-zmq does? [11:03:10] !log restart eventlogging/forwarder legacy-zmq on eventlog1001 due to slow memory leak over time (cached memory down to zero) [11:03:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:05:21] so --^ caused https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=4&fullscreen&var-server=eventlog1001&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now-1m [11:05:37] we dropped ~30G of memory used [11:05:38] ahhaha [11:08:11] 10Analytics, 10User-Elukey: latest varnishkafka fails to build on Debian - https://phabricator.wikimedia.org/T186250#3945632 (10elukey) [11:08:31] 10Analytics, 10User-Elukey: latest varnishkafka fails to build on Debian - https://phabricator.wikimedia.org/T186250#3938638 (10elukey) Going to triage/work on it as part of Analytics, thanks :) [11:26:57] opened https://phabricator.wikimedia.org/T186510 to track the EL issue [11:27:08] at the moment we are good, memory usage seems stable [11:29:25] need to go out for an errand + lunch [11:29:43] I don't expect issues but if anything happens and the forwarder is still consuming memory: [11:29:50] sudo stop eventlogging/forwarder NAME=legacy-zmq CONFIG=/etc/eventlogging.d/forwarders/legacy-zmq [11:29:53] and then start [11:30:03] * elukey errand + lunch! [13:19:42] 10Analytics-Tech-community-metrics: Enable Discourse backend in wikimedia.biterg.io once discourse.mediawiki.org gets into production - https://phabricator.wikimedia.org/T186513#3945877 (10Aklapper) p:05Triage>03Low [13:20:19] 10Analytics-Tech-community-metrics: Enable Discourse backend in wikimedia.biterg.io once discourse.mediawiki.org gets into production - https://phabricator.wikimedia.org/T186513#3945877 (10Aklapper) [13:49:26] now eventlog1001's legacy zmq forwarder has gained 2G in this timeframe [13:49:33] and from the logs it seems doing nothing [13:51:58] elukey: Hi! [13:52:14] elukey: if it;s really doing nothing, a 2G leak in a few hours is awefully bad !!! [13:54:27] joal: hello! [13:54:36] I am actually checking if it is still the forwarder [13:54:37] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=4&fullscreen&var-server=eventlog1001&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now-1m [13:57:12] from top I am not seeing anymore a big difference [13:57:13] mmmm [13:57:17] it might be a slow leak [13:57:27] so the config pulls data from kafka [13:57:39] and sends it to the forwarder itself on port 8600 [13:58:30] at some point in my life I'll know all the daemons of eventlogging :D [14:00:24] ahhh it might be related to navigation timing [14:06:29] so it uses the EventConsumer class to listen on 8600 and filter data [14:10:16] so I was reading it wrong, both cached and used memory grew [14:10:52] used memory seems stablish, but let's keep an eye on it, I suspect that the forwarder leaks some data very slowly [14:11:01] it was ~30G two hours ago :D [14:11:16] thanks elukey for having found that :) [14:17:26] puppet kinda suggested it (Could not evaluate: Cannot allocate memory - fork(2)) :P [14:19:10] elukey: I have double checked again the etherpad for tomorrow morning -- looks awesome :) [14:25:04] joal: super thanks! Than we are all set :) [14:25:13] what time do you prefer to start tomorrow morning? [14:25:53] elukey: If good for you, I can be available from 9:30 [14:27:17] elukey: We could have coffee reviewing the plan, then start when you wish [14:31:07] o/ [14:38:52] joal: ok for me! [14:38:53] ottomata: o/ [14:42:10] ottomata: whenever you have time can you explain to me what the legacy forwarder does on el? [14:42:59] afaics it reads from kafka and try to push to tcp:$eventlog1001_ip:8600 [14:43:18] that is a bit unclear to me [14:43:26] I saw the custom tcp handler in handle.py [14:43:31] that is supposed to send to zmq [14:43:46] elukey: https://phabricator.wikimedia.org/T110903#3582398 [14:43:46] :) [14:43:47] but I am not familiar enough with the code to judge what it is doing [14:44:04] eventlogging originally used zmq as the main queue [14:44:06] not kafka [14:44:19] when we switched to kafka, we kept a zmq endpoint up for subscribers to use [14:45:42] I remember, but is it running on eventlog1001? [14:46:35] ya [14:46:40] morning! [14:46:44] I can see that port 8600 is bound by the forwarder though [14:47:00] ya [14:47:30] ah i see you rq [14:47:44] it is mostly ignorance, just trying to get how it works :) [14:47:53] is the zmq queue part of the forwarder itself? [14:47:59] ah yes [14:48:01] yeah [14:48:09] if so it would make sense why that thing reached 30G today [14:48:11] it isn't really 'pushing' to port 8600 [14:48:27] the zmq 'consumer' there is like the mysql 'consumer' [14:48:32] didi you see https://phabricator.wikimedia.org/T186510 ? :( [14:48:45] messages come from kafka and then get handed to the zmq 'writer', [14:48:55] which just makes the messages available to zmq subscribers on port 8600 [14:49:01] yeah now it makes sense [14:49:08] wow no! [14:49:12] interesting! [14:49:20] wow [14:49:24] I had to restart it since it was causing a bit of a mess [14:49:24] if we can get coal off of zmq [14:49:26] we can stop it [14:49:35] i've been waiting for that for 2.5 years [14:49:43] https://gerrit.wikimedia.org/r/#/c/403560/ [14:50:08] 10Analytics, 10User-Elukey: Eventlogging's forwarder/zmq-legacy leaks memory over time - https://phabricator.wikimedia.org/T186510#3946093 (10Ottomata) @Krinkle :D how soon can we do https://gerrit.wikimedia.org/r/#/c/403560/ ? [14:53:07] ottomata: Now that we have a stable version of streaming in spark, would that be an option/. [14:53:11] ottomata: ? [14:53:15] haha [14:53:26] ottomata: :D [14:53:42] forward to zmq? or resplace eventlogging processors? :) [14:54:01] ottomata: replace the special processor [14:54:16] ottomata: The other one is a bit more tricky I assume [14:54:40] ottomata: https://www.youtube.com/watch?v=fZY8jUuEzJQ [14:54:43] haha, which part do you want to do in spark? validation and ua parsing and producing to kafka specific topics? [14:54:53] schema* specific [14:55:53] that would be: consume record, look at schmea and version, look up jsonschema in meta.wm.org, encapsulate (with EventCapsule), do any ua parsing, etc. validate, produce to destinations [14:56:17] ottomata: My understanding is that there is aspecific processor that still uses ZMQ (https://gerrit.wikimedia.org/r/#/c/403560/) - Would spark streaming be a good tool for that? [14:56:21] btw, elukey i think they are going to make us upgrade to stretch on eventlog1002 [14:56:27] ah the zmq one [14:56:40] joal: if https://gerrit.wikimedia.org/r/#/c/403560/ gets merged, we can completely disable the zmq one [14:56:48] ottomata: for the EL validator, I have already done a version that was supposed to do the job [14:57:02] did you have something pulling jsonschemas from meta too? just curious? [14:57:10] ottomata: I did [14:57:20] i think i remember looking at that [14:57:21] ottomata: Ot at least I think I did :) [14:57:25] when we started doing jsonrefine [14:57:29] correct [14:57:43] ottomata: I proposed on Friday a research spike with Faidon/Moritz of 1 day to figure out if systemd can work, and then report back to them [14:57:46] wdyt? [14:57:51] joal: i think i'd rather figure out stream data platform stuff first, with a hope of getting rid of analytics specific eventlogging [14:57:58] elukey: sure! [14:58:02] i mean, it *can* work [14:58:03] ottomata: works for me :) [14:58:22] Gone to get Lino from the creche team - Back for standup [14:58:22] :) [14:58:28] latters [14:58:35] elukey: maybe newer versions will just work easily [14:58:42] but even if they don't, the old version *can* work [14:58:55] it'll just make managing all the various el processes much more annoying [14:59:06] as there will be no more 'eventloggigng ctl stop' [14:59:08] yeah, we can definitely manage them one by one [14:59:09] we'll have to script it all [14:59:11] or something [14:59:15] yep [14:59:35] so worst case scenario, we script something that does the restart sequentially (or something similar) [15:02:08] ottomata: another thing that I wanted to ask - while reading Buffer exception reports for kafka confluent python, I saw magnus repeating a lot of times that something like poll(0) needs to be called once in a while.. now I am wondering - could it be that librdkafka, even without a callback set, needs to have the poll() called to free events in the "async buffer" ? [15:02:17] (03PS1) 10Addshore: Add .gitreview [analytics/wmde/WDCM-Structure-Dashboard] - 10https://gerrit.wikimedia.org/r/408276 [15:02:27] (03CR) 10Addshore: [V: 032 C: 032] Add .gitreview [analytics/wmde/WDCM-Structure-Dashboard] - 10https://gerrit.wikimedia.org/r/408276 (owner: 10Addshore) [15:02:47] it seems relatively inexpensive to test [15:06:55] 10Analytics, 10EventBus, 10Multi-Content-Revisions, 10Services (doing): Redesign revision-related event schemas for MCR - https://phabricator.wikimedia.org/T186371#3946234 (10Ottomata) > We might consider doing a backward incompatible change and rename the properties, but it's not advisable. Additional pro... [15:07:23] elukey: oh really? [15:07:25] poll(0)? [15:07:25] hm [15:07:40] i suppose, kinda sucks though if that is the case [15:08:28] 10Analytics-EventLogging, 10Analytics-Kanban: Hive EventLogging tables not updating since January 26 - https://phabricator.wikimedia.org/T186130#3946237 (10Ottomata) Hm, ok, thanks Tilman, will investigate this today. [15:09:14] ottomata: I had the suspicion reading https://github.com/confluentinc/confluent-kafka-go/issues/109#issuecomment-344072558 [15:10:14] hellooooo [15:13:10] elukey: I'm trying to track down what's wrong with the sqoop_mediawiki job, and I see it's defined here: https://github.com/wikimedia/puppet/blob/ac54b3b79efe23a76f3d6903b1b8a7606973e7db/modules/role/manifests/analytics_cluster/coordinator.pp#L59 [15:13:20] what machine does that mean it's running on? [15:13:29] all of the cluster? [15:13:29] analytics1003! [15:13:34] oh! [15:13:35] ok [15:14:22] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3946271 (10Ottomata) > batch.num.messages is 10000 by default, and queue.buffering.max.messages is 100000 Hm, I thoug... [15:23:31] ahhh joal, i remember now why I put transform function after insertition into dest hive table. [15:23:50] one of the reasons for this task was purging, and an idea was to just write a purged df out to a different location [15:24:14] but, to do that properly, it should be written with the correct merged schema [15:24:41] (sorry, not 'after insertino into dest hive table', i mean: after preparing hive table and merging schema) [15:24:54] so, i got a problem. [15:25:07] 10Analytics-Kanban: sqoop_mediawiki failed pagelinks for a few wikis - https://phabricator.wikimedia.org/T186529#3946290 (10Milimetric) [15:25:26] in order to have transformFn possibly alter schema (a use case we don't have...yet) i need to do it before prepareHiveTable (which merges schemas) [15:25:54] in order to write out a secondary location with the proper schema, I need to call transformFn after prepareHiveTable [15:25:56] hm [15:31:39] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291#3946334 (10elukey) test [15:46:37] Hi milimetric - What's wrong with sqoop_mediawiki? [15:46:59] joal: a few dbs failed 3 times on pagelinks [15:47:05] they're all smaller, so it's weird [15:47:10] I'm re-running them manually no [15:47:11] now [15:47:18] ottomata: Maybe we could split prepareHiveTable step in two? --> (prepareSchema, applyToHive)? [15:47:20] will add the flags once they're done [15:47:32] (they seem to be all running fine now) [15:47:56] milimetric: I have not seen any error in emails - How do we get that? [15:48:03] I checked the schemas, no difference for those particular wikis, cluster must've just hated them. But this adds to my suspicion about something weird going on with smaller wikis, they're slower and more error prone sometimes [15:48:27] joal: neil needed the snapshot, so I backtracked from denormalize coordinator [15:49:11] k milimetric [15:49:42] joal: ? [15:49:47] milimetric: +1 about something bizarre on small wikis [15:50:15] i'd like to be able to do both things with these configurable functions: [15:50:15] - augment data schemas [15:50:15] - write to other locations [15:50:29] to augment the schema, we need to call functions before hivetable stuff [15:50:39] to write to other lcoations, we need to call functions after hive table stuff (schema merging) [15:50:57] so, we could: [15:50:57] - have two different transform function flags, one for before hive merge and one for after [15:50:59] ottomata: I don't get the second part [15:51:01] or [15:51:18] joal: the purging idea was to write data to 2 locations during refeine [15:51:20] a public and private one [15:51:33] with the public one being purged [15:51:38] at refine time [15:51:39] !log live hacked deployment-eventlog02's /srv/deployment/eventlogging/analytics/eventlogging/handlers.py to add poll(0) to the confluent kafka producer - T185291 [15:51:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:51:41] T185291: Verify duplicate entry warnings logged by the m4 mysql consumer - https://phabricator.wikimedia.org/T185291 [15:52:17] ottomata: I get that - I however don't understand why it would mean applying tFn after hive schema merging [15:53:00] if we want the secondary location to have the same schema [15:53:13] we need to write out the df with the same schema [15:53:20] and we don't know what the final schema will be [15:53:23] until we merge witih the hive table [15:54:33] ottomata: This comes back to my original point: Split hive stuff in two and insert Fn in the middle: merge-hive-schemas, tFn, Apply hive schemas and write [15:54:56] ottomata: I might be missing something though [15:55:32] not sure how you'd split it, it is passed in by user [15:56:02] joal: say i want to: geocode, and then purge and write purged df to secondary lcoation [15:56:10] geocode needs to happen before hive schema merging [15:56:20] and purging and write to secondary location needs to happen after [15:57:36] ottomata: I think I get the point now: by applying tFn to the dataset after purging, you prevent having to know about tFns when purging [15:59:16] joal, the idea was to have purging happen in a tFn [15:59:20] something like [15:59:54] purgedDf = purge(incomingDf) [15:59:54] purgedDf.write.parquet('other/public/location') [15:59:54] return incomingDf [16:00:36] hm [16:37:13] 10Analytics, 10Analytics-Features, 10MediaWiki-extension-requests: "Reverted edits" view for Contributions - https://phabricator.wikimedia.org/T186536#3946570 (10Suncatcher_13) [16:55:15] so weird, kowiki.pagelinks is still not going beyond 0% 0% [17:01:59] milimetric: is it an empty table? [17:04:56] milimetric: no oit is not [17:05:05] no, yeah, I was just querying it too [17:05:22] maybe it's just that the shard it's on in labsdb is busy? [17:05:33] milimetric: but i do not understand why this table is needed [17:05:43] this goes into the clickstream dataset [17:05:44] milimetric: we are not doing clickstream for this one [17:05:56] milimetric: clickstream is not done for all wikis though [17:06:05] oh, right, well, the sqoop just gets all tables for all wikis configured [17:06:06] milimetric: just (ccjoal) about 10 [17:06:18] milimetric: that also seems it could be improved [17:06:44] cc joal [17:10:28] ok, all sqoop finished, all _SUCCESS written, all jobs should go now [17:10:32] (by themselves) [17:10:51] Thanks a lot for that milimetric ! [17:20:21] 10Analytics, 10MediaWiki-extension-requests: "Reverted edits" view for Contributions - https://phabricator.wikimedia.org/T186536#3946646 (10Nuria) [17:20:44] 10Analytics, 10MediaWiki-extension-requests: "Reverted edits" view for Contributions - https://phabricator.wikimedia.org/T186536#3946561 (10Nuria) cc-ing @Catrope as it seems this is a fit for their team [17:21:36] 10Analytics-Kanban, 10User-Elukey: Eventlogging's forwarder/zmq-legacy leaks memory over time - https://phabricator.wikimedia.org/T186510#3946653 (10Nuria) [17:23:47] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistat Beta: expand topic explorer by default - https://phabricator.wikimedia.org/T186335#3941423 (10Nuria) a:03Nuria [17:29:19] 10Analytics, 10Analytics-Cluster: Move non-critical monthly jobs to the nice queue - https://phabricator.wikimedia.org/T186180#3936013 (10Nuria) @Tbayer : maybe you can help us identify here what is not critical ? We could schedule jobs for app sessions later in the month for example, this data does not seem... [17:30:51] 10Analytics, 10Analytics-EventLogging: Provide MediaWiki timestamps in Hive-refined EventLogging tables via UDF - https://phabricator.wikimedia.org/T186155#3946715 (10Nuria) [17:32:17] 10Analytics-EventLogging, 10Analytics-Kanban: Provide MediaWiki timestamps in Hive-refined EventLogging tables via UDF - https://phabricator.wikimedia.org/T186155#3935292 (10Nuria) a:03fdans [17:32:47] 10Analytics-EventLogging, 10Analytics-Kanban: Provide MediaWiki timestamps in Hive-refined EventLogging tables via UDF - https://phabricator.wikimedia.org/T186155#3935292 (10Nuria) Maybe @fdans can work on this one after the launch of the maps on wikistats? [17:34:22] 10Analytics, 10Analytics-Wikistats, 10Accessibility: Wikistats Beta: Fix accessibility/markup issues of Wikistats 2.0 - https://phabricator.wikimedia.org/T185533#3946737 (10Nuria) [17:38:14] 10Analytics-Kanban: sqoop_mediawiki failed pagelinks for a few wikis - https://phabricator.wikimedia.org/T186529#3946290 (10Milimetric) [17:39:23] 10Analytics-Kanban: Make sqoop cron job report errors if success flags are not written - https://phabricator.wikimedia.org/T186541#3946770 (10Milimetric) [17:39:42] 10Analytics: Make Wikipedia clickstream dataset available as API - https://phabricator.wikimedia.org/T185526#3918334 (10Nuria) Great idea, moving it to next year. Priority wise we need to work on bot filtering before an item like this one, but possible. [17:40:55] 10Analytics, 10Analytics-Wikistats: Confusing abbreviation on Wikistats 2.0 Alpha - https://phabricator.wikimedia.org/T184011#3946791 (10Nuria) Ping @Nemo_bis this issue will not be fixed until we have translations/localization. [17:41:37] 10Analytics-Kanban: Make sqoop python code write success flags for each table that's fully imported for all wikis - https://phabricator.wikimedia.org/T186542#3946793 (10Milimetric) [17:42:49] 10Analytics, 10Analytics-Wikistats: Confusing abbreviation on Wikistats 2.0 Alpha - https://phabricator.wikimedia.org/T184011#3946803 (10Nuria) I think 1G = 1000 millions might be best UX compromise until we have internacionalization. [17:48:47] 10Analytics, 10Pageviews-API, 10RESTBase-API, 10Services (watching): Pageviews Data : removes 1000 limit in the most viewed articles for a given project and timespan API - https://phabricator.wikimedia.org/T153081#3946813 (10Nuria) This is better served by adding "dimensions" to pageviews such you can requ... [17:51:38] 10Analytics: Prototype counting of requests with real time (streaming data) - https://phabricator.wikimedia.org/T159264#3946818 (10Nuria) [17:51:40] 10Analytics-Kanban: Eventlogging of the Future - https://phabricator.wikimedia.org/T185233#3946819 (10Nuria) [17:53:00] 10Analytics, 10Analytics-Cluster, 10Operations: Clean up permissions for privatedata files on stat1005 - they should be group readable by statistics-privatedata-users - https://phabricator.wikimedia.org/T89887#3946826 (10Nuria) [17:54:05] 10Analytics, 10Analytics-Cluster, 10Operations: Clean up permissions for privatedata files on stat1005 - they should be group readable by statistics-privatedata-users - https://phabricator.wikimedia.org/T89887#3946832 (10Milimetric) [18:00:46] 10Analytics-EventLogging, 10Analytics-Kanban: Hive EventLogging tables not updating since January 26 - https://phabricator.wikimedia.org/T186130#3946853 (10Ottomata) Ah, this was my fault. When I reran the jobs, I only started from 4 days in the past. We def need alarms on missing data. Implementing them he... [18:10:28] ottomata: We’re working on the event design for JADE, https://docs.google.com/drawings/d/1Lagl0BJWVWHNvHLy5y6RNNKvl0C1tdVrE5YniwgqFJY/edit and I have a terrible question. Is it evil to use one of the eventbus UUIDs as a permanent (non-primary) key in order to allow an event’s creator to locate asynchronously created resources? [18:11:38] awight: not sure i understand, but you mean mabye the meta.id field? [18:12:17] ottomata: exactly that sort of thing, yes. It looks like IDs are always set by the event emitter? [18:12:28] & wondering, are we free to use UUID > 1? [18:12:49] https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/2.yaml#L28-L31 [18:12:57] hmm UUID1 is actually okay, it doesn’t == dt [18:13:02] right [18:13:07] dt is just encoded in it [18:13:15] great [18:13:42] and, hm, sure i think you are asking if it makes sense to use it to find an event later, since you don't know what your primary (auto increment?) id will be in your database? [18:13:48] and if so, sure [18:13:57] yes it makes sense now, thansk. [18:14:02] one more thing. Do you know if any MediaWiki extensions are consuming from EventBus? [18:14:03] we actually use a simliar field in eventlogging mysql for deduplication, and will be doing the same in hive [18:14:12] consuming from? not that I know of [18:14:16] harr. [18:14:25] That was a pretty deep assumption in this design. [18:14:28] awight: you want to consume from kafka from mediawiki? [18:14:30] yes [18:14:33] aye [18:14:34] cool [18:14:38] hehehe [18:14:45] we actually probably need to update the kafka client MW is currently using [18:14:48] it is old and crufty [18:14:54] “cool” as in, we’re the rocket dogs to pilot this concept? [18:15:38] yes, but looking for task... [18:15:44] i need a new mw php kafka client too.. [18:15:52] not for consuming, but ... [18:16:36] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Port Kafka clients to new jumbo cluster - https://phabricator.wikimedia.org/T175461#3946948 (10Ottomata) [18:16:38] 10Analytics, 10Discovery, 10Patch-For-Review: Send Mediawiki Kafka logs to Kafka jumbo cluster with TLS encryption - https://phabricator.wikimedia.org/T126494#3946947 (10Ottomata) [18:16:40] I guess we would implement something like a changeprop endpoint, the MW extension would have an API endpoint where the events are sent? [18:16:44] awight: https://phabricator.wikimedia.org/T126494 [18:17:08] awight: why do you want to consume from kafka with mediawiki? [18:17:13] so... [18:17:56] We want the JADE API (Python) to accept actions, some of which need to be replicated into the MediaWiki databsase. I had been imagining that EventBus would be an ideal mechanism to do that. [18:18:25] 10Analytics-Kanban, 10Patch-For-Review: Remove AppInstallIId from EventLogging purging white-list - https://phabricator.wikimedia.org/T178174#3946952 (10Nuria) Of interest: https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data [18:19:12] ah ok, so you want mediawiki extension to consume from kafka and write to db? [18:19:28] where/how is that mediawiki consumer going to run? [18:20:25] I’m thinking, the extension provides a MW API endpoint and something like changeprop reads our event topic, then writes to this endpoint with na at-least-once guarantee. [18:20:28] *an [18:20:58] in ^ changeprop is the kafka consumer then, no? [18:20:59] not mw? [18:22:00] yes that would work. [18:22:20] haha, i'm not saying TO do that [18:22:26] just trying to understand [18:22:56] (btw, when you say 'eventbus' here, its a little confusing, generally we refer to the eventbus as the way to get events into kafka, as there is no consumer side eventbus tech. just kafka :) ) [18:23:01] but, anyway [18:23:37] milimetric: I'm gonna kill the clickstream job - looks like the cluster behaves in a non-expected way [18:24:19] joal, that worked! [18:24:31] ottomata: I just saw that - it unlocked everything [18:25:01] ottomata: +1 by “eventbus” I just mean the Kafka cluster we use to pass eventbus messages around. [18:25:03] ok - now I need to figure out why spark was reserving containers without starting them [18:25:36] ok cool [18:25:54] awight: what do you want to write to the mw db? [18:26:04] judgements? [18:26:17] because creating judgement is done by mw? [18:26:32] judgment...( huh, no e...ok!) [18:26:40] lol it can be spelled either way [18:26:44] oh [18:26:50] weird [18:27:31] Yeah exactly. JADE has its own pgsql store of judgments, the event stream is another source of truth (the most authoritative truth, fwiw), and MediaWiki maintains another copy of the data, in a new namespace. [18:27:57] https://books.google.com/ngrams/graph?content=judgment%2Cjudgement&year_start=1400&year_end=2000&corpus=18&smoothing=3&share=&direct_url=t1%3B%2Cjudgment%3B%2Cc0%3B.t1%3B%2Cjudgement%3B%2Cc0 [18:28:00] 10Analytics, 10MediaWiki-extension-requests: "Reverted edits" view for Contributions - https://phabricator.wikimedia.org/T186536#3946992 (10Suncatcher_13) [18:28:16] 10Analytics, 10MediaWiki-extension-requests: "Reverted edits" view for Contributions page - https://phabricator.wikimedia.org/T186536#3946561 (10Suncatcher_13) [18:28:30] (03PS1) 10Ottomata: Fix JsonRefine so that it respects --until flag [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408323 [18:28:35] ottomata: oops! That was British English. https://books.google.com/ngrams/graph?content=judgment%2Cjudgement&year_start=1500&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Cjudgment%3B%2Cc0%3B.t1%3B%2Cjudgement%3B%2Cc0 [18:29:22] something was wrong with the clickstream job ... It was not supposed to allocate more than 32 workers, but seemed to use more [18:29:38] awight: does nothing else have ability to write to mw db? do you need mw to do it? [18:29:40] I'm gonna restart it manually once the mediawiki-history is done [18:31:50] ottomata: We could write to the MW API directly, but IMO it’s cleaner to have that responsibility encapsulated in a MW extension. We can also provide an endpoint which we call synchronously, skipping the kafka mediation. I like the idea of decoupling but in #wikimedia-ai, halfak and I are realizing that we need the resulting MW rev_id, so might as well have a sync API. [18:31:56] 10Analytics, 10Code-Stewardship-Reviews, 10Operations, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319#3912816 (10greg) Feedback also at https://www.mediawiki.org/wiki/Talk:Code_stewardship_reviews/Feedback_solicitation/IRCR... [18:32:32] awight: aye, but i'm asking, do you need mw to write to mw db? [18:32:43] why not something that just consumes from kafka and writes to mysql? [18:33:15] ottomata: Are there any precedents for that? I don’t like the coupling... [18:33:23] aye [18:33:23] dunno [18:34:13] so, you plan to make a mw extension that provides an api endpoing like POST /create/judgment or whatever [18:34:28] and then, you'd have a kafka consumer somewhere that would POST to it when you need to make a judgment [18:34:45] i guess that sounds good.... [18:34:49] change-prop would probably help you there [18:34:59] since that is what it is designed to do (POST to HTTP in response to a kafka message) [18:35:11] (or, not POST per say, but make HTTP request) [18:35:48] so ya, awight sounds like if you have the MW API endpoint, AND you have a judgment event in kafka, change-prop will be your kafka consumer, and you don't need to figure out how to do that part in mw [18:35:49] OR [18:35:56] if you did want to consume from kafka with mw [18:35:58] you could. [18:35:58] but [18:36:10] you'd need to figure out where to run such a consumer as a daemon process [18:36:15] ottomata: I think either the sync or the Kafka-isolated endpoint is good for our use cases. Thanks for taking the time to go through this with me! [18:36:37] +1 I think I’ll skip the latter option :) [18:39:00] (03CR) 10Ottomata: [C: 032] Fix JsonRefine so that it respects --until flag [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408323 (owner: 10Ottomata) [18:41:00] do we have any non-production kafka cluster (i.e. accessible from labs) that feeds from eventbus? [18:41:12] not production events no SMalyshev [18:41:19] there are test events in deployment-prep (beta) though [18:41:28] well doesn't have to be production, any events [18:41:28] ya [18:41:45] deployment-kafka-jumbo-1 and deployment-kafka-jumbo-2 should ahve them mirrored just like prod [18:41:47] from which wiki is it? [18:42:06] https://deployment.wikimedia.beta.wmflabs.org [18:42:13] great, thanks! [18:42:53] :) [18:45:15] (03PS1) 10Nuria: [WIP] Changing initialization of QTree to work arround precision Bug [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408329 [18:49:06] Thanks elukey for the fast answer [19:05:53] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3947117 (10BBlack) Meta-update since this is quite stalled out now. I'll try to line up all the explanatory bits here that are affecting proces... [19:24:22] * elukey off! [20:02:29] 10Analytics, 10User-Elukey: latest varnishkafka fails to build on Debian - https://phabricator.wikimedia.org/T186250#3947314 (10Jrdnch) @Liuxinyu970226 Thanks, I wasn't aware that the #Varnish project is archived. [20:07:53] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3947335 (10diego) [20:15:44] (03PS2) 10Mforns: Optimize WikiSelector for slow browsers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria) [20:23:06] (03CR) 10jerkins-bot: [V: 04-1] Optimize WikiSelector for slow browsers [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria) [20:25:12] (03CR) 10Mforns: Optimize WikiSelector for slow browsers (036 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria) [20:25:53] nuria_, I found a problem with the WikiSelector code that was affecting the TopicSelector (because they both use SearchResults) [20:26:04] fixed that and the patch is good to be CR'ed [20:26:31] please, see if the performance problem has been solved, or reduced [20:26:46] I left some comments on the patch [20:27:13] thx! [20:28:20] (03CR) 10Nuria: "Tested and while still not super speedy on my browser is much faster, the rendering that took almost 7 secs before is 1.5 now." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/405398 (https://phabricator.wikimedia.org/T185334) (owner: 10Nuria) [20:29:25] Am I correct to think that EventLogging schemas do not support nested objects? [20:29:36] hm jenkins is failing.. [20:35:22] marlier: they do but they are not recommended [20:35:32] https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines [20:37:10] Cool, good enough for me. [20:37:11] Thanks [20:45:12] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3947432 (10Ottomata) Thanks @bblack, it's at least good to know that we'll need to do the IPSec thing or this will block us for a long while. I... [20:54:27] (03CR) 10Ottomata: [C: 032] Add configurable transform function to JSONRefine [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/405800 (https://phabricator.wikimedia.org/T185237) (owner: 10Ottomata) [20:54:50] mforns: lemme know if you want to brain bounce jsonrefine purging stuff [20:55:20] ottomata, sure do you have time now? [20:55:22] sure [20:55:25] bc! [20:55:28] k! [22:10:02] (03PS1) 10Ottomata: Factor out RefineTarget from JsonRefine for use with other jobs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408435 [22:11:48] (03PS2) 10Ottomata: Factor out RefineTarget from JsonRefine for use with other jobs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/408435 (https://phabricator.wikimedia.org/T181064) [23:02:44] it seems uBlock Origin is now blocking eventlogging requests [23:03:49] incidentally, the wikitech docs on how to debug eventlogging problems are very outdated [23:05:56] e.g. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging talks about an eventlogging-errors logstash channel which does not exist, https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_representations talksa about /srv/eventlogging on stats100* which also does not exist, something somewhere mentioned the events.errorevent table on hive which does exist but does not seem to be working... [23:09:09] tgr, fixed the /srv/eventlogging stat100*, it should be /srv/log/eventlogging [23:09:25] thx [23:10:07] tgr, what's wrong with event.eventerror hive table? [23:10:27] it seemed completely empty [23:10:35] I might have messed up the query though [23:10:49] hive (event)> select event.schema, event.message from eventerror where year = 2018 and month=2 and day=1 and hour=0 limit 10; [23:10:49] OK [23:10:49] schema message [23:10:49] ChangesListFilters 4 is not of type 'string' [23:10:49] ChangesListFilters 4 is not of type 'string' [23:10:49] MobileWikiAppFindInPage 'findText' is a required property [23:10:49] ChangesListFilters 4 is not of type 'string' [23:10:50] ChangesListFilters 0 is not of type 'string' [23:10:50] ChangesListFilters 0 is not of type 'string' [23:10:51] ChangesListFilters 0 is not of type 'string' [23:10:57] MobileWikiAppFindInPage 'findText' is a required property [23:10:57] ChangesListFilters 0 is not of type 'string' [23:11:11] hive (event)> select event.schema, event.message from eventerror where year = 2018 and month=2 and day=1 and hour=0 limit 1; [23:11:11] schema message [23:11:11] ChangesListFilters 4 is not of type 'string' [23:11:43] the logstash thing, not sure. [23:11:52] oh duh, apparently it did not fully sink in yet that it's 2018 [23:11:57] haha :) [23:11:59] sorry about the false alarm on that [23:12:39] having the errors in logstash would be super nice though, it's the first place where I ususally look [23:13:31] i think very few of us admining eventlogging know how to use logstash, which is one of the reasons why it isn't maintained; we never look there. [23:13:42] that was set up long ago, dont' know the current status of it [23:14:19] the dashboard does not exist: https://logstash.wikimedia.org/app/kibana#/dashboards?_g=() [23:14:38] there is an 'eventlogging' dashboard but that seems to be about internal errors [23:15:23] or rather, mostly random? [23:16:26] it contains everything with the word 'eventlogging' in it, such as random errors that happen on a wiki page called 'Eventlogging' [23:16:53] huh oooook.... [23:17:03] not very useful [23:17:28] is the uBlock thing known? [23:17:35] uBlock don't know anything about either [23:19:12] it's claimed to be used by 10M Chrome users [23:19:34] so probably popular enough to be worth reaching out [23:19:40] I'll file a task [23:26:52] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, 10Services (doing): FY17/18 Q3 Program 8 Services Goal: Migrate two high-traffic jobs over to EventBus - https://phabricator.wikimedia.org/T183744#3947864 (10Pchelolo) [23:26:55] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3947861 (10Pchelolo) 05Open>03Resolved Seems like the migration is complete with no issues. Resolving [23:27:19] ottomata: btw I can set up / fix things in logstash if someone can show me the format of events sent to logstash [23:28:51] oh! ok cool tgr. they are in kafka, so all that needs to happen is them be consumed from kakfa into lostash somehow [23:28:59] this is the schema [23:28:59] http://meta.wikimedia.org/w/index.php?title=Schema:EventError [23:29:12] wrapped in teh event capsule thing [23:29:24] you can see a few if you consume them from kafka [23:29:26] e.g. [23:29:35] it might well be happening, maybe the only thing missing is the dashboard (which is a saved query basically) [23:29:55] kafkacat -C -b kafka1012.eqiad.wmnet:9092 -t eventlogging_EventError [23:30:11] (afk)