[00:17:47] 10Analytics, 10EventBus, 10MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), 10Patch-For-Review, 10Services (next): Timeouts on event delivery to EventBus - https://phabricator.wikimedia.org/T180017#3811197 (10Pchelolo) So I've spent some time digging the error stack trace that we're seeing a... [02:07:10] 10Analytics-Kanban: Get access to geowiki data - https://phabricator.wikimedia.org/T182027#3811381 (10Nuria) >BTW, gp.wmflabs.org is permanently offline, right? We should update the documentation on Wikitech in that regard. All geowiki docs are outdated we will tackle that in a few weeks time [03:14:12] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3811435 (10Jason821) Oh I see. Could you provide an example druid yaml file to apply this jmx_exporter, which produce the JVM metrics in your demo? Than... [08:37:45] 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Add a prometheus metric exporter to all the Druid daemons - https://phabricator.wikimedia.org/T177459#3811611 (10elukey) >>! In T177459#3811435, @Jason821 wrote: > Oh I see. Could you provide an example druid yaml file to apply this jmx_exporter, which p... [09:07:11] hellooooteam [09:09:02] o/ [09:10:47] 10Quarry, 10User-revi: Rename Hym411 to -revi - https://phabricator.wikimedia.org/T182064#3811684 (10revi) [09:18:51] so in a bit I'd need to reboot the Hadoop master nodes 1001/1002 [09:18:58] and also druid1003 [09:39:21] so the idea is the following [09:39:37] 1) stop all the daemons on analytics1002, reboot the host [09:39:57] 2) once we verified that it works fine, failover an1001 to it [09:40:19] 3) if something doesn't smell right, fallback to an1001 [09:40:36] otherwise reboot of 1001 and then failover from 1002 to 1001 [09:40:58] or, to minimize the restarts [09:41:16] failover to 1002, reboot of 1001, reboot of 1002 [09:41:24] second one seems better [09:42:41] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Services Q2 2017/18 goal: Migrate a subset of jobs to multi-DC enabled event processing infrastructure. - https://phabricator.wikimedia.org/T175212#3811765 (10mobrovac) [09:42:46] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3811766 (10mobrovac) [09:42:51] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3811764 (10mobrovac) [09:45:45] (03CR) 10Mforns: Add pageviews by country endpoint (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520) (owner: 10Fdans) [09:56:26] joal: o/ - do you remember what was the issue last time with the map reduce history server? [09:56:37] I remember that we had a little issue since it was down for a bit [09:57:17] !log suspend webrequest load bundle as extra precaution before Hadoop masters reboot [09:57:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:57:31] hmm [09:58:02] elukey: IIRC we were expecting jobs not fail, and they did, because we hard-rebooted history server, or something like that [09:59:18] there were some temporary weirdness [09:59:27] when elukey ? [09:59:46] when we rebooted an1001, since the history server is only on that host [09:59:53] right [10:00:08] but I don't remember what [10:00:22] anyhow, green light from you? [10:00:55] I'll wait the last webrequest jobs to complete [10:01:15] sure elukey [10:01:36] elukey: there are other stuff running, but it's ok to proceed IMO [10:03:00] !log stop camus as precautionary measure before Hadoop masters reboot [10:03:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:08:44] elukey: I will again discuss the need to pause webrequest jobs if you stop camus :-P [10:08:51] ufffff [10:09:01] :D [10:09:27] I'll re-enable them, it makes me feel safet :D [10:09:30] *safer [10:09:42] so, namenode manually failed over to 1002 [10:10:55] stopping resource manager on 1001 (so it fails over to 1002 [10:11:10] all good [10:11:19] and yarn doesn't work anymore as expected [10:11:38] sounds good :) [10:12:48] aaand 1001 is going to reboot in a min [10:13:34] I'd need to update https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Ports [10:13:55] so we should also get prometheus monitoring after the reboot [10:14:09] hehe :) [10:14:58] 10Analytics-Kanban: Geowiki stopped updating on October 24th - DATA LOSS (read comments) - https://phabricator.wikimedia.org/T179952#3811827 (10elukey) a:05elukey>03Milimetric [10:16:18] 1001 up and running [10:20:34] everything works! [10:20:59] all right so next step is do the same on 1002 [10:21:04] elukey: can you move master back to 1001, for UI to work? [10:21:04] going to fail back over to 1001 [10:21:07] Yes :) [10:21:09] Thanks :) [10:23:39] aaaand 1001 is back the master [10:24:24] rebooting 1002 now [10:26:21] elukey: no failures on my end - awesome [10:26:29] elukey: do we also have to reboot druid1003? [10:28:26] joal: yep! [10:30:53] I am going to verify that 1002 doesn't explode [10:30:58] sure [10:31:04] but in the meantime I can drain 1003 [10:31:40] elukey: if you think there's this kind of risk while rebooting, I'll make sure to hide behind you every time you go for that [10:31:42] !log disabled druid middlemanager on druid1003 with curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable [10:31:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:34:53] joal: ahahha no no but usually when I reboot nodes once in a while hw fails badly [10:34:59] like kafka1018 [10:35:18] all right re-enabling jobs and camus [10:35:24] Yeah, I know, I was just joking, as usual [10:35:32] :P [10:35:41] morning joke makes my day feel lighter [10:35:42] !log re-enable webrequest bundle and camus after reboots [10:35:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:36:36] (also checked the suspended status of all the coordinators as Joseph tells me every time) [10:36:46] :D [10:38:01] joal: last host standing - analytics1003 and druid1003 (to reboot) [10:38:08] ok [10:38:15] I need to do some puppet changes for 1003 to enable prometheus [10:38:34] elukey: Let's go for druid first, then an1003? [10:39:32] mforns: wanna talk privacy threshold? [10:39:42] heu fdans sure :] [10:39:46] batcave? [10:39:51] yep! [10:39:55] joal: weird thing, I don't see any index_realtime_banner_activity_minutely running [10:39:59] last one is index_realtime_banner_activity_minutely_2017-12-04T20:00:00.000Z_0_0 [10:40:16] hm [10:40:30] Spark job runs [10:41:29] this is bad [10:43:00] do you see holes in pivot? [10:43:13] Actually there is no data in pivot [10:43:27] this comes from spark job being stuck, but I have no clue why [10:44:58] Is it the banner activity cube right? I see datapoints, but probably checking the wrong place [10:45:13] lemme reboot druid1003 then [10:54:19] elukey: job back in track [10:54:26] elukey: I have no clue why it failed :( [10:54:38] joal: is there a way to add alarms ? [10:54:41] actually - I have no clue why it stopped working without failing [10:54:47] :( [10:55:05] elukey: my plan is to add "data checks" -- let us know when data doesn't flow, so that we restart the job manually [10:55:14] ah wow I refreshed pivot's data, now I see the hole [10:55:58] AndyRussG: o/ [10:56:12] FYI we just discovered a big hole in realtime data for banners [10:56:22] we just restarted the job that is responsible for pushing data [10:57:15] it'll be covered by daily batch indexation, but it'll look bad today [10:57:27] 10Analytics-Kanban, 10User-Elukey: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943#3812015 (10elukey) [11:01:37] elukey: I'm thinking of how to alert for data holes [11:02:18] elukey: implementing a cron checking druid is fine, but I'm afraid of email flooding when we'll experience issues [11:02:30] elukey: any suggestion? [11:05:39] joal: if the email is hourly it is not a big deal in my opinion, at least as first option [11:06:04] elukey: since data is to be checked for real-time, cron would go every five minutes or so [11:06:35] yeah but we can be fine even checking hourly no? [11:06:42] or is it mandatory to check every 5m? [11:07:11] the other long term approach could be to add realtime monitoring + support in druid_exporter for realtime data [11:07:16] and alert on those metrics [11:07:17] elukey: mandatory or not is for to decide, so we can go for hourly heck [11:07:33] we don't really have an SLA for these things, one hour should be fine [11:08:44] ok [11:44:14] fdans, can you try it now? [11:44:21] I mean we [11:44:47] I'm in mforns [11:44:51] ok [11:55:53] 10Analytics-Kanban: Productionize streaming jobs - https://phabricator.wikimedia.org/T176983#3812114 (10JAllemandou) a:03JAllemandou [12:02:37] * elukey lunch! [12:14:09] (03CR) 10Joal: "Comments inline !" (036 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/393591 (https://phabricator.wikimedia.org/T181520) (owner: 10Fdans) [12:18:33] (03CR) 10Joal: "Even before drilling down into code, a big question: I assume we're interested in user only data, but as with top-articles, we probably ar" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/394062 (https://phabricator.wikimedia.org/T181521) (owner: 10Fdans) [12:29:17] 10Analytics-Kanban, 10Patch-For-Review: Add documentation for .m suffix code to pagecounts-ez doc page - https://phabricator.wikimedia.org/T180452#3758404 (10mforns) [12:29:26] 10Analytics-Kanban: Create scala-spark job to ingest simple data sets from Hive-EventLogging to Druid to Pivot - https://phabricator.wikimedia.org/T179976#3812298 (10mforns) [12:34:48] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Please review the WDCM public datasets and allow them to access published datasets on stat1005 - https://phabricator.wikimedia.org/T181871#3812306 (10mforns) Looking into this! [13:07:46] joal: I'd need an hour to complete a puppet refactoring for oozie/hive, then I'll check you code review ok? [13:08:14] np elukey - will a take break soon, we'll review after [13:17:50] taking a break -ateam [13:35:00] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Please review the WDCM public datasets and allow them to access published datasets on stat1005 - https://phabricator.wikimedia.org/T181871#3812495 (10mforns) Yes, definitely not privacy sensitive. Please, you can move the datasets to... [13:46:41] (03PS22) 10Mforns: Add core class and job to import EL hive tables to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) [13:51:13] (03CR) 10jerkins-bot: [V: 04-1] Add core class and job to import EL hive tables to Druid [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/386882 (https://phabricator.wikimedia.org/T166414) (owner: 10Mforns) [14:00:33] 10Analytics, 10Analytics-EventLogging, 10Tracking: Use draft 4 of JSON Schema specification - https://phabricator.wikimedia.org/T46809#3812575 (10phuedx) [14:00:35] 10Analytics, 10Analytics-EventLogging, 10Tracking: Update client-side event validator to support (at least) draft 3 of JSON Schema - https://phabricator.wikimedia.org/T182094#3812560 (10phuedx) [14:05:57] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic: Please review the WDCM public datasets and allow them to access published datasets on stat1005 - https://phabricator.wikimedia.org/T181871#3812592 (10GoranSMilovanovic) 05Open>03Resolved @mforns Thank you very much! @Lydia_Pints... [14:11:57] (03PS1) 10Hashar: Add .gitreview [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/395536 [14:11:59] (03PS1) 10Hashar: build: drop PhantomJS, use Chrome/Firefox [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/395537 [14:12:20] 10Analytics, 10Analytics-EventLogging, 10Tracking: Use draft 4 of JSON Schema specification - https://phabricator.wikimedia.org/T46809#3812635 (10phuedx) Per T182000#3809548, the server-side JSON Schema validators support draft 3 of JSON Schema. AFAICT this is not the case for the client-side validator. [14:12:39] 10Analytics, 10Analytics-EventLogging, 10Tracking: Update client-side event validator to support (at least) draft 3 of JSON Schema - https://phabricator.wikimedia.org/T182094#3812639 (10phuedx) There are performance penalties for delivering heavyweight libraries to the client but they're far outweighed by th... [14:15:02] o/ [14:17:36] 10Analytics, 10Analytics-EventLogging, 10Tracking: Use draft 4 of JSON Schema specification - https://phabricator.wikimedia.org/T46809#482238 (10Ottomata) Wow, I didn't know this task existed. I have so many desires and ideas on how to make all EventLogging schemas way better. The EventLogging python codeb... [14:18:01] (03CR) 10Hashar: "That switch CI from PhantomJS to [Chrome, Firefox]. it is magic :)" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/395537 (owner: 10Hashar) [14:18:54] 10Analytics, 10Analytics-EventLogging, 10Tracking: Update client-side event validator to support (at least) draft 3 of JSON Schema - https://phabricator.wikimedia.org/T182094#3812699 (10Ottomata) Hm, wait, the client side validator doesn't already support draft 3? What version is it using? AFAIK the EventL... [14:19:22] 10Analytics, 10Analytics-EventLogging, 10Tracking: Update client-side event validator to support (at least) draft 3 of JSON Schema - https://phabricator.wikimedia.org/T182094#3812703 (10Ottomata) https://github.com/wikimedia/eventlogging/blob/master/eventlogging/schema.py#L206-L220 [14:19:57] elukey: heyaaaa https://phabricator.wikimedia.org/T182027 [14:20:06] to do that, i need to put tilman in the statistics-admins group [14:20:14] which will give him sudo -u stats privs [14:20:19] do you think that needs to go through regular ops process? [14:20:24] ops meetings, etc? [14:20:34] lemme check [14:21:10] in theory yes [14:21:30] but in practice we are granting privileges: ['ALL = (stats) NOPASSWD: ALL'] to only stats things [14:22:02] does it need to be done nowish or Monday is ok? [14:23:38] 10Analytics, 10Analytics-EventLogging, 10Tracking: Update client-side event validator to support (at least) draft 3 of JSON Schema - https://phabricator.wikimedia.org/T182094#3812717 (10phuedx) @Ottomata: I //think// I've provided the correct [link to the client-side validator code](https://github.com/wikime... [14:30:35] elukey: ? ALL = (stats) NOPASSWD: ALL' let's you sudo -u stats and run anything, right? [14:30:45] sudo -u stats [14:30:54] but elukey i don't know [14:31:03] i'll post that on ticket then [14:31:46] ottomata: yep exactly, what I meant is that we allow access to what the stats user can do, that is basically analytics things [14:32:38] so it shouldn't be a huge problem to only decide between our team what to do, but the ops rules are clear, need a ops review first :) [14:32:38] 10Analytics-Kanban: Get access to geowiki data - https://phabricator.wikimedia.org/T182027#3812737 (10Ottomata) > it would be preferable to also being able to access the folder on stat1006 K, to do this we need to put you in the `statistics-admins` group, which gives you privileges to `sudo -u stats`. Since... [14:32:53] 10Analytics-Kanban, 10Operations, 10Ops-Access-Requests: Get access to geowiki data - https://phabricator.wikimedia.org/T182027#3812738 (10Ottomata) [14:33:21] 10Analytics-Kanban, 10Operations, 10Ops-Access-Requests: Get access to geowiki data - https://phabricator.wikimedia.org/T182027#3812740 (10Ottomata) [14:33:38] k thanks elukey, added to ops-access-requests [14:34:49] ottomata: np! completely unrelated, in https://gerrit.wikimedia.org/r/#/c/395527/ there is a refactoring of oozie/hive roles to profiles [14:34:56] lemme know whenever you have time if you like it [14:35:27] and yes I know I need to review https://gerrit.wikimedia.org/r/#/c/394438/, doing it now :P [14:36:13] elukey: let's deploy ^ v soon now maybe together! :D [14:37:12] modules/profile/manifests/kafka/broker.pp:178 WARNING variable not enclosed in {} (variables_not_enclosed) [14:39:59] the CR looks good! [14:41:00] yours too! [14:41:45] I think that pcc is now broken since they are migrating it to puppet 4 :( [14:42:33] aye, elukey want to first create the new cluster certificate together? in batcave? [14:43:21] sure! [14:52:45] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3810367 (10Joe) A few things to note: - htmlCacheUpdate job frequency varies a **lot** between wikis. Even a moderately large wiki like `dewiki` can h... [15:05:48] hi! [15:06:31] helllllo [15:07:20] mforns: want to resume our conversation? [15:07:29] fdans, yes! [15:07:31] omw [15:08:06] mforns: batcave dos! [15:08:10] ok ok [15:11:30] 10Analytics-Kanban: Geowiki stopped updating on October 24th - DATA LOSS (read comments) - https://phabricator.wikimedia.org/T179952#3812878 (10Milimetric) [15:26:48] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Alpha release: Wikistats 2 UI feedback From Erik Z - https://phabricator.wikimedia.org/T178084#3813005 (10Milimetric) [15:26:56] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Beta Release: Remaining UI advice from Erik - https://phabricator.wikimedia.org/T182109#3813007 (10Milimetric) [15:27:23] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Alpha release: Wikistats 2 UI feedback From Erik Z - https://phabricator.wikimedia.org/T178084#3680164 (10Milimetric) [15:28:12] 10Analytics, 10Analytics-Wikistats: Beta Release: Remaining UI advice from Erik - https://phabricator.wikimedia.org/T182109#3813007 (10Milimetric) [15:53:03] 10Analytics-Kanban, 10Pageviews-API, 10Services (watching): Endpoints that 404 no longer have the "Access-Control-Allow-Origin" header - https://phabricator.wikimedia.org/T179113#3813204 (10Milimetric) @Pchelolo I'm looking at this again. To recap, the CORS headers were being added by hyperswitch but this b... [16:12:35] oh elukey you say: https://phabricator.wikimedia.org/T167304#3496938 [16:12:39] "After checking the kafka-authorizer.log file I had to add the following rules to avoid deny errors:" [16:15:14] aahahah [16:15:34] ah yes on the confluent topics [16:15:45] but I didn't have our defaults [16:16:32] 10Analytics, 10Analytics-EventLogging, 10Tracking: Update client-side event validator to support (at least) draft 3 of JSON Schema - https://phabricator.wikimedia.org/T182094#3813354 (10Ottomata) Oh, yeah, you are right. https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/schemas/sche... [16:17:46] well and also the __consumer_offsets [16:17:46] topics [16:17:46] but [16:17:48] if you have to do it on those [16:17:52] you'd have to do it on all topics too... [16:18:18] not excited about making execs for those... [16:18:23] super.users! :) [16:20:19] only once, easily doable at bootstrap if the script is auto-generated :) [16:20:30] did you get it working? [16:20:55] I mean, are the __consumer_offsets topics special and need explicit ACLs ? [16:21:27] not special [16:21:30] when i add the ACL for the otto1 topic [16:21:34] [2017-12-05 16:21:12,181] DEBUG Principal = User:CN=kafka_jumbo-eqiad_broker is Denied Operation = Describe from host = 10.64.32.160 on resource = Topic:otto1 (kafka.authorizer.logger) [16:22:07] we'd have to do it for all topics? unless we can do topic=*? [16:23:29] topic=* would work but it is weird.. [16:23:50] (I meant it is weird that error) [16:24:19] did the past me mentioned this in the task? [16:24:46] if i set super.users [16:24:47] i get [16:24:47] User:CN=kafka_jumbo-eqiad_broker is a super user, allowing operation without checking acls. [16:26:55] awwwww [16:28:38] I am not entirely opposed to super.users, just trying to figure out if the brokers need to have a basic safe net [16:28:46] say in case there is a weird exploit of something similar [16:29:11] because we would regret it later on if something happens [16:29:43] yeah, but i mean, if someone can authenticate as the broker with whatever ACLs we'd need to manually set [16:29:49] we'd regret the same things [16:30:39] true [16:32:07] "The default behavior is such that if a resource has no associated ACLs, then no one is allowed to access the resource, except super users. Setting broker principals as super users is a convenient way to give them the required access to perform inter-broker operations:" [16:32:22] oh you found some docs! [16:32:23] https://www.confluent.io/blog/apache-kafka-security-authorization-authentication-encryption/ [16:32:24] where's that? [16:32:29] nice [16:32:33] hurray [16:32:34] :) [16:32:40] so apologies, you are right, and me too paranoid [16:38:38] elukey: https://gerrit.wikimedia.org/r/#/c/395568/ [16:40:03] +1ed [16:48:30] elukey: with super users everyithng works great! [16:48:34] \o/ [16:48:44] all the acls adding, removeing etc work as expected [16:49:00] so we are ready to think about cache::misc ? [16:49:12] configuring vk should be really easy [16:49:23] yeah sure! i'm going to restart mirror maker here [16:49:45] I'll have a chat with em*a tomorrow to figure out how to roll this over [16:51:09] k, we gotta be real careful though, camus, kafkatee, etc. [16:51:22] any clients that use werequest_misc we gotta think about [16:51:35] ottomata: would it be ok from the camus consuming perspective to have say one cp host producing to kafka-jumbo and the rest on analytics? [16:53:00] elukey: maybe, we probably need to set up a secondary camus job to consume from both clusters into the same location in hdfs [16:53:23] elukey: we could run a test varnishkafka producing toa test topic with SSL [16:53:24] because it would be really handy, rolling out to one and observe it for say one day [16:53:29] ah sure [16:53:39] that way we could be sure that the varnishkafka side works [16:53:44] before having to mess with camus or other clients [16:53:54] ottomata: what if we create a new vk instance to test it ? [16:53:58] we have to think about it, but i think it will be easier to the whole topic at once [16:54:04] elukey: ya exactly [16:54:13] all right will work on it tomorrow then [16:54:15] :D [16:54:16] I like the idea [17:08:45] all right logging off a bit ealier today people! [17:08:53] will re-check later [17:08:56] * elukey off! [17:17:50] oh look llegar m [17:21:09] gone for diner a-team, back in a while [17:21:37] bye joal, I'm signing off, byeaaall [17:38:41] 10Analytics: R execution on stat1005 -> 'stack smashing error' - https://phabricator.wikimedia.org/T174946#3813666 (10Erik_Zachte) 05Open>03Resolved I see recent R charts again! It was an elusive bug, hard to replicate. When I ran from the command line all went well. Also from a bash file with same command... [17:42:47] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3813690 (10mobrovac) Set a deployment window for the migration for [2017-12-06 17:30 UTC](https://wikitech.wikimedia.org/wiki/Deployments#Week_of_Decem... [17:56:08] elukey ahhhh craziness! [17:56:26] it seems that with any Current ACLs for resource `Cluster:kafka-cluster`: set at all [17:56:33] e.g if you ever set any --producer perms for any topic [17:57:01] it will restrict any cluster operations for any non listed principal! [17:57:19] took me the last hour to figure out why mirrormaker couldn't produce [18:02:41] AH because we need ANONYMOUS alllowed for some cluster stuff [18:03:29] 10Analytics: Incorporate data from the GeoIP2 ISP database to webrequest - https://phabricator.wikimedia.org/T167907#3813770 (10Nuria) Bought and being updated: https://gerrit.wikimedia.org/r/#/c/395549/ [18:04:07] 10Analytics: Incorporate data from the GeoIP2 ISP database to webrequest - https://phabricator.wikimedia.org/T167907#3813771 (10Nuria) We should plan to add this to webrequest early q3 [18:17:11] ottomata: ping [18:17:52] So you;'ve deployed my thread safety patch to beta right? What do you think about putting it on kafka1001 in prod an look what's happens till the evening? [18:19:11] Pchelolo: into it, in a meeting and fnishing up some kafka ssl stuff [18:19:15] i think we can do that today [18:19:21] actually, if you wanna go right ahead [18:19:21] :) [18:19:39] ye I can deploy on kafka1001 [18:19:56] will do and will monitor the logs and keep you posted [18:29:08] cool thanks Pchelolo! [18:29:28] deployed, so far so good [18:53:18] (03PS2) 10Joal: Update mediawiki-history-reduced job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/389496 (https://phabricator.wikimedia.org/T178504) [18:57:50] 10Quarry, 10User-revi: Rename Hym411 to -revi - https://phabricator.wikimedia.org/T182064#3813921 (10zhuyifei1999) 05Open>03Resolved a:03zhuyifei1999 https://quarry.wmflabs.org/-revi [19:26:37] ottomata: So the eventbus service was running on kafka1001 with my fix for a couple of hours and all look perfect (no errors) [19:26:48] What do you think about getting bold and deploying all over? [19:26:53] +1 :) [19:27:05] Pchelolo: we will need to enable async in puppet too [19:27:27] i'll do puppet first, then when you deploy it'll pickup the change [19:28:20] ottomata: hm ok kafka1001 it's already async=True [19:28:26] yes [19:28:29] special cased [19:28:37] for the test [19:28:41] oh, ok, gotcha [19:28:51] ping me when you're done with puppet [19:35:16] Pchelolo: puppet applied, go for it. [19:35:24] kk [19:38:11] ok, deployed. I will be watching it [19:38:19] oooook [19:38:23] hopefully that resolves the timeouts [19:38:25] fingers crossed this helps tiemouts [19:38:26] yeah [19:38:28] who knows! [19:38:53] oh Pchelolo i forgot i also wanted to merge this one [19:38:53] https://gerrit.wikimedia.org/r/#/c/393672/ [19:40:39] looking [19:43:48] Hi all! Some issues with the pipeline somewhere on the way to Druid? https://goo.gl/jaoJN5 Thx in advance!!! :) [19:43:51] heh.. I don't think I a good reviewer for that patch ottomata as I don't really understand what's happening there [19:48:45] Pchelolo: its ok, its pretty harmless. a long time ago i had tried to do the json parsing and validation of the event async too, but it doesn't work like i thought it did back then. [20:23:04] AndyRussG: yes, realtime pipeline will have holes due to some upgrades, once batch jobs catch up they should be filled in [20:48:57] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Migrate htmlCacheUpdate job to Kafka - https://phabricator.wikimedia.org/T182023#3814268 (10Pchelolo) > I think it's ok to use a small wiki to test the functionality without touching the concurrency configs, though. Which wiktionary... [20:49:24] nuria_: gotcha, thanks much! :D [21:28:37] Gone for tonight a-team - See you tomorrow [21:32:00] laters [21:32:01] ! [23:14:12] 10Analytics, 10Discovery, 10EventBus, 10Wikidata, and 3 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3814596 (10Smalyshev) @Ottomata thanks, I can connect to the hosts above, but still not sure how to control the starting point. I'll try to look around... [23:14:47] 10Analytics, 10Discovery, 10EventBus, 10Wikidata, and 4 others: Create reliable change stream for specific wiki - https://phabricator.wikimedia.org/T161731#3814597 (10Smalyshev)