[00:56:33] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: SparkR on Spark 2.3.0 - Testing on Large Data Sets - https://phabricator.wikimedia.org/T192348#4156076 (10GoranSMilovanovic) Hi @JAllemandou I think it all needs to go to HDFS first. After starting a SparkR ses... [01:01:07] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: SparkR on Spark 2.3.0 - Testing on Large Data Sets - https://phabricator.wikimedia.org/T192348#4156084 (10GoranSMilovanovic) 05Open>03Resolved [01:29:59] HaeB: omg that's insane. I knew they had a big cluster but holy poop [01:44:35] and they still can't fix that search column in tweetdeck that has been broken for me since like a year [02:18:53] :) yeah, some things are hard, some bugs are load-bearing [02:19:35] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: [EPIC] Reading List Sync service analytics - https://phabricator.wikimedia.org/T191859#4118961 (10Tbayer) > Method 1 has the disadvantage that we would be able to find out username given crossDevi... [05:48:27] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: SparkR on Spark 2.3.0 - Testing on Large Data Sets - https://phabricator.wikimedia.org/T192348#4156197 (10elukey) >>! In T192348#4138708, @GoranSMilovanovic wrote: > @elukey The ideal situation would be to have... [06:03:15] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Report updater setting log ownership incorrectly (leading to cronspam) - https://phabricator.wikimedia.org/T191871#4119336 (10elukey) [07:09:08] Morning elukey [07:11:06] o/ [07:11:39] joal: this morning I am reading a bit about NUMA, Erik did a great work in T191236 [07:11:40] T191236: Resolve elasticsearch latency alerts - https://phabricator.wikimedia.org/T191236 [07:12:05] all the hadoop workers afaics have two NUMA nodes [07:13:58] the -XX:+UseNUMA could be some nice and quick test to perform on a couple of analytics worker nodes [07:23:53] elukey: I've read a bit as well on NUMA, but I don't know really how to test perf [07:25:26] there are nice tools like numastats that can give some indication about how it is going, I am currently looking if we have prometheus metrics about it [07:26:04] in theory, since we do run numa hw, it should be reasonable to ask the JVMs to be aware of it [07:26:42] elukey: agreed on theory : [07:27:06] joal: ah by the way, did you see the journalnodes metrics??? [07:27:07] elukey: druid move yesterday went fine I assume? [07:27:13] I have not elukey ! [07:27:35] new section in https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1! [07:27:55] about druid: didn't do anything, Dan preferred not to proceed due to the data outage [07:28:28] and last update from him some hours ago was that something was still ongoing and it was probably best to wait [07:28:38] but don't have too much context :( [07:34:21] elukey: I can give context on data outage [07:34:33] elukey: 2 different problems [07:35:20] biggest one was was data corruption due to me testing the reduced-load job without removing the druid indexation step (Meeeeeehhhhhh - I apologize deeply for that) [07:36:06] second: new jar seems not to compute correctly (or not store, not sure) user being anonymous - leading to incorrect slicing and dicing in WKS2 [07:36:46] nuria reindexed the 2018-02 snaphsot, allowing for data to be correct if not up-to-date [07:36:54] That's where we areI think [07:37:02] ack! [07:37:18] do you think that we shouldn't upgrade druid analytics too? [07:38:00] elukey: I really don't view the issues as related [07:38:30] But, I don't know how the teaml prefers us to move, preventing to have too many moving pieces at the same time [07:38:50] elukey: see T192959 about preventing messingup indexations [07:38:50] T192959: only hdfs (or authenticated user) should be able to run indexing jobs - https://phabricator.wikimedia.org/T192959 [07:40:24] joal: I think that Dan's point was not to have two different clusters if we need to make comparison etc.. [07:40:46] I am ready to upgrade anytime, already done all the prep work [07:41:44] elukey: after having made a big mistake this week, I don't feel comfortable taking decision on this [07:41:55] elukey: But I'm fully ready to support anytime [07:42:56] joal: there is no hurry, I think we can skip to next week [07:44:52] also about the authentication task, http://druid.io/docs/0.11.0/configuration/auth.html [07:45:06] from 0.11 onwards TLS + HTTP basic auth is available [07:46:38] not that it will be trivial to integrate this [07:46:44] yeah [07:48:40] 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959#4156322 (10elukey) [07:49:04] from the NUMA side, we have an extra prometheus collector for those metrics [08:41:54] 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10Patch-For-Review, 10User-GoranSMilovanovic: SparkR on Spark 2.3.0 - Testing on Large Data Sets - https://phabricator.wikimedia.org/T192348#4156472 (10GoranSMilovanovic) @elukey Thanks! [08:50:46] joal: since you insisted a lot forcing me to work on NUMA this morning :D, I created https://grafana.wikimedia.org/dashboard/db/analytics-numa [08:50:51] still Wip :) [08:51:17] we are now collecting metrics for druid and the hadoop workers [09:05:12] 10Analytics: RStudio web version on SWAP - https://phabricator.wikimedia.org/T180270#4156559 (10GoranSMilovanovic) @mpopov from the URL provided, `Loading repository: yuvipanda/binder-1/upgrade2` takes forever. [09:08:20] elukey: I know I shouldn't have insisted, but man, those charts are nive :) [09:08:35] s/v/c [09:09:29] :D [09:09:33] added other ones now [09:15:59] joal: this one is interesting https://grafana.wikimedia.org/dashboard/db/analytics-numa?panelId=5&fullscreen&orgId=1 [09:17:17] some good info in https://www.systutorials.com/docs/linux/man/8-numastat/ [09:20:52] the diff between druid and analytics hosts is really interesting [09:45:54] elukey: I'm focusing on data issue now, but I promise I'll read after [09:48:09] yep sorry for the distraction :) [09:57:48] going to reimage two hadoop nodes to stretch [10:06:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4156759 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1061.eqiad.wmnet', 'an... [11:10:38] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4156960 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1062.eqiad.wmnet', 'analytics1061.eqiad.wmnet'] ``` and were **ALL** su... [11:26:33] * elukey lunch! [11:41:36] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is CRITICAL: 6.06e+06 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:42:55] just got this nice alert --^ [11:43:42] https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_main-codfw&refresh=5m&panelId=5&fullscreen&orgId=1&from=1524656134301&to=1524656389879 [11:43:59] there was a spike but it resolved, maybe the alert is still too sensitive [11:44:31] it should clear in a couple of mins [11:48:36] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [11:48:49] good :) [11:54:11] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10MediaWiki-extensions-Translate, 10Services (done): Unable to mark pages for translation in Meta - https://phabricator.wikimedia.org/T192107#4157107 (10Nikerabbit) [12:07:12] elukey: Have you restarted oozie for webrequest-upload or shall I do it? [12:10:23] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.294e+08 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [12:13:49] joal: please go ahead :) [12:15:02] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 4.692e+07 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [12:18:16] interesting! [12:18:16] https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad&refresh=5m&panelId=5&fullscreen&orgId=1 [12:18:21] big lag [12:18:26] but now it is auto-resolving [12:21:07] joal: do we need to restart webrequest-upload? I can see only a stats alerts [12:21:10] *alert [12:21:36] (stats == dataloss) [12:24:00] elukey: You're right ! [12:24:07] elukey: currently running the check [12:24:38] !log Only false positive for Data Loss Warning - Workflow webrequest-load-check_sequence_statistics-wf-upload-2018-4-25-10 [12:24:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:26:02] \o/ [12:29:11] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Unify, if possible, AQS and Restbase's cassandra dashboards - https://phabricator.wikimedia.org/T193017#4157303 (10elukey) p:05Triage>03Normal [12:29:33] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [12:30:25] RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [12:51:44] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4157398 (10Pchelolo) [12:51:48] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Enable CP4JQ support for private wikis - https://phabricator.wikimedia.org/T191464#4157392 (10Pchelolo) 05Open>03Resolved a:03Pchelolo Support was enabled for all wikis except wikitech (see T192361 for reasoning). Resolving. [12:56:39] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (done): LocalGlobalUserPageCacheUpdateJob always fails - https://phabricator.wikimedia.org/T192405#4157412 (10Pchelolo) 05Open>03Resolved This has been resolved by enabling EventBus extension on `loginwiki` wiki with T191464 [13:04:25] elukey: for overall view (still WIP) https://grafana-admin.wikimedia.org/dashboard/db/joal-numa?orgId=1&from=now-3h&to=now [13:06:29] joal: I had the same view but I thought it was a bit confusing to spot patterns [13:07:07] but let's add everything to one dashboard if possible :) [13:08:33] also very interesting comment from Erik https://phabricator.wikimedia.org/T191236#4152025 [13:09:05] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.279e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [13:10:40] this one looks a bit more werid [13:10:44] *weird --^ [13:10:54] kafka1013 started to push a ton of data [13:10:54] o/ elukey sorry for all the mirrormaker lag spam [13:10:59] i'm going to increase the threshold [13:11:03] (oh is something else going on?) [13:11:08] i think job queue is crazy [13:11:15] sorry just signing on [13:11:23] ottomata: o/ [13:11:52] in all cases it seemed due to "bursty" behavior [13:12:10] yeah [13:12:23] i think the elasticawrite does HUGE messages [13:12:25] this one instead it is interesting: kafka1013 is indeed consuminging more data (or seems so) [13:12:36] aye, single partition topics [13:12:48] so i betcha elasticawrite (or whatever) is assigned to that [13:12:57] no idea what that is :D [13:13:21] it's a CirrusSearch job [13:13:21] me neither, i think a cirrussearch job [13:13:32] :) [13:13:37] ack :) [13:13:52] we really don't need to replicate those job or change-prop topics at all [13:13:53] today me and joseph started to play a bit with NUMA metrics [13:13:53] since guillaume started to restart the cluster we start to pile all writes to the job queue [13:13:55] i just left it as is [13:14:02] we could disable the main -> analytics stuff too [13:14:04] yep yep makes sense [13:14:21] but, we'll eventually want to re-enable main -> jumbo after the main upgrade [13:16:34] wait i am a bit lost [13:16:34] worth noting that it's only testwikis (including mw.org) so far (iirc), other wikis including big ones haven't been migrated yet :/ [13:17:21] elukey: so we've blacklisted all job|cp topics from mm instances, except for main-eqiad -> analytics-eqiad [13:17:36] i didn't blacklist there just because, it was working. the brokers are both 0.9, mm is 0.9 [13:17:40] so, it is actually working [13:17:45] but the topics are bursty and causing lag [13:17:45] so [13:17:51] ahhh okok I forgot this bit, now it makes sense :) [13:17:52] we could just go ahead and blacklist them from main -> analytics [13:17:54] sense we don't need them [13:18:04] or we could expand lag alert threshold [13:18:38] yep [13:29:37] elukey: not sure if you noticed, but i renamed the kafka prometheus dashes to just kafka etc. [13:29:42] the old ones (for analytics) are called (graphite) [13:30:55] yes noticed, super good :) [13:36:46] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: [EPIC] Reading List Sync service analytics - https://phabricator.wikimedia.org/T191859#4157565 (10Ottomata) One nit! Remember that your json field names are going to be directly mapped to caseles... [13:37:45] elukey: are you reimaging any worker nodes right now? [13:38:01] i haven't done any yet (keep forgetting) and want to do one so I know all the steps and can just fire them off when i have time [13:39:40] ottomata: nope, did two earlier on but didn't have time for more [13:39:43] next one is 1060 [13:39:49] I added some steps in the task's descr [13:39:52] ok, can I do now? lemme see if I know how... [13:39:53] k [13:40:05] OH AMAZING [13:40:07] you have instructions! [13:40:10] yep :) [13:40:12] ok going to do 1060 [13:40:29] !log stop camus on an1003 as prep step to gracefully restart hive server [13:40:32] joal: --^ [13:40:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:41:04] ack elukey [13:41:37] joal: not sure if you've read my email about the UDF blacklist [13:41:58] but basically the fix forbids xpath,xpath_string,xpath_boolean,xpath_number,xpath_double,xpath_float,xpath_long,xpath_int,xpath_short [13:42:29] maybe worth to send an email to analytics@ ? [13:44:16] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4151252 (10JAllemandou) a:03JAllemandou [13:44:30] 10Analytics, 10CirrusSearch, 10Discovery, 10EventBus, and 5 others: Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts - https://phabricator.wikimedia.org/T191024#4157584 (10Pchelolo) I believe the fix fo... [13:44:37] (03PS1) 10Joal: Correct mediawiki-history job bugs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428922 (https://phabricator.wikimedia.org/T192841) [13:47:29] 10Analytics, 10CirrusSearch, 10Discovery, 10EventBus, and 5 others: Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts - https://phabricator.wikimedia.org/T191024#4157597 (10dcausse) @Pchelolo yes I belie... [13:51:10] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4032605 (10Pchelolo) [13:51:16] 10Analytics, 10CirrusSearch, 10Discovery, 10EventBus, and 4 others: Exception thrown while running DataSender::sendData in cluster codfw: Data should be a Document, a Script or an array containing Documents and/or Scripts - https://phabricator.wikimedia.org/T191024#4157616 (10Pchelolo) 05Open>03Resolved... [13:53:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4157628 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1060.eqiad.wmnet'] ``` T... [14:02:53] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157665 (10dcausse) I don't have strong opinions on which wikis we should migrate next. My sole concerns right now is regarding write freezes when... [14:05:36] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157685 (10mobrovac) >>! In T189137#4157665, @dcausse wrote: > I don't have strong opinions on which wikis we should migrate next. group1 could be... [14:09:09] ottomata: helloooo would you have a minute in the batcavearooni? [14:09:29] RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [14:09:31] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157701 (10Pchelolo) The subtasks that were created to fix issues discovered during the first iteration of the switch were resolved, and I don't se... [14:10:08] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Unify, if possible, AQS and Restbase's cassandra dashboards - https://phabricator.wikimedia.org/T193017#4157704 (10elukey) I created https://grafana-admin.wikimedia.org/dashboard/db/cassandra-aqs to port manually all the metrics names and see the discrepancies.... [14:11:29] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157706 (10dcausse) >>! In T189137#4157685, @mobrovac wrote: >>>! In T189137#4157665, @dcausse wrote: >> I don't have strong opinions on which wiki... [14:13:10] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157711 (10mobrovac) Given the numbers above, going with everything but enwiki, wikidata and commons should be a good next round. [14:14:12] ottomata: https://gerrit.wikimedia.org/r/#/c/428926/1/templates/wmnet is not ready to merge right? [14:14:17] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157713 (10Pchelolo) > When we freeze writes we start to push ElasticaWrite jobs that contain the full page doc which can be relatively large. We h... [14:14:51] elukey: shouldn't hurt, unless you see a problem [14:14:55] its tricky entering those things [14:14:59] lots of room for typos and mistakes [14:15:44] I'd prefer to merge it when we add the static ipv6 [14:16:18] for example, kafka1001 ip addr shows 2620:0:861:101:1618:77ff:fe33:5242 [14:16:46] now say we add the AAAA record for 2620:0:861:101:10:64:0:11 [14:16:56] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4032605 (10Ottomata) I already feel like 4Mb messages are a lot, and would much prefer not to increase the max message size more. Can these jobs b... [14:17:03] and some client/broker/whatever picks it up [14:17:19] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157722 (10dcausse) >>! In T189137#4157713, @Pchelolo wrote: >> When we freeze writes we start to push ElasticaWrite jobs that contain the full pag... [14:17:32] oh, hm, i had thought the dns would be harmless [14:17:43] was planning on merging the dns first, and then adding mapped ipv6 when we reimage to stretch [14:18:10] we can do the other way around no? [14:18:12] sure [14:18:15] super [14:18:19] elukey: can we just add the ipv6s now? [14:18:23] before rebooting? [14:18:32] the mapped ipv6 [14:18:41] just ensure in puppet, and it will then do it? [14:19:07] before rebooting? are you reimaging them now? [14:19:14] nonoo [14:19:14] (sorry didn't follow) [14:19:15] sorry [14:19:16] ha [14:19:28] i mean, can/should we do add the mapped ipv6 now [14:19:29] ahead of time [14:19:34] ahhhhh! [14:19:36] guess why not, eh? [14:19:37] yes i think so [14:19:43] ok, let's do that first [14:19:46] and if they all get the right IPs [14:19:48] then we merge DNS? [14:19:52] maybe a super safe: kafka stop, apply puppet, kafka start? [14:19:58] hm [14:19:59] yeah [14:20:10] just to make sure existing ipv6 connections don't get wonky? [14:20:16] exactly [14:20:19] ya k [14:20:33] let's sync with services first [14:20:36] maybe we can start with codfw [14:22:56] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4157727 (10Ottomata) [14:26:22] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157730 (10Gehel) >>! In T189137#4157706, @dcausse wrote: >>>! In T189137#4157685, @mobrovac wrote: >>>>! In T189137#4157665, @dcausse wrote: >>> M... [14:26:43] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157731 (10Pchelolo) > If there is a way to monitor such errors I guess we can pick-up known large pages and modify them while the write are frozen... [14:27:43] elukey: putting add_ip6_mapped in site.pp violates wmf-style? [14:27:48] wmf-style: node 'kafka[12]00[123]\.(eqiad|codfw)\.wmnet' declares interface::add_ip6_mapped [14:28:24] in theory it shouldn't, it was whitelisted [14:28:33] https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/20166/console [14:28:49] but the new gem might not be updated [14:28:56] so jenkins still complain [14:31:52] k [14:32:00] will say 'SHUT UP JENKINS" [14:32:11] exactly :D [14:34:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4157755 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1060.eqiad.wmnet'] ``` and were **ALL** successful. [14:37:06] !log restart hive-server2 on analytics1003 to pick up settings in https://gerrit.wikimedia.org/r/428919 [14:37:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:39:48] !log re-enable camus after maintenance [14:39:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:43:11] (03CR) 10Mforns: Correct mediawiki-history job bugs (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428922 (https://phabricator.wikimedia.org/T192841) (owner: 10Joal) [14:43:39] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Report updater setting log ownership incorrectly (leading to cronspam) - https://phabricator.wikimedia.org/T191871#4157778 (10elukey) a:03elukey [14:43:46] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Report updater setting log ownership incorrectly (leading to cronspam) - https://phabricator.wikimedia.org/T191871#4119336 (10elukey) [14:44:23] * elukey coffeee! [14:47:51] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.204e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [14:48:01] allright alrgiht [14:48:05] lowering lag threshold [14:50:50] oo i'll just blacklist the job and and cp topics from lag check alerts [14:54:50] (03CR) 10Joal: "Comment inline, Thanks @mforns :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428922 (https://phabricator.wikimedia.org/T192841) (owner: 10Joal) [14:58:15] (03CR) 10Mforns: [C: 032] "Sorry for not spotting those in earlier reviews!" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428922 (https://phabricator.wikimedia.org/T192841) (owner: 10Joal) [14:58:31] elukey: hm, I'm looking at ipv6 connections on kafka[12]001 [14:58:37] i only see thigns like ::ffff:10.192.0.139:9092 [15:03:14] there are no (real) ipv6 connections [15:03:14] so i'm pretty sure just adding the mapped IPv6 will be fine [15:03:14] no need to stop brokers [15:03:14] should be yes [15:03:15] let's try with 2001 then and see [15:03:17] k [15:03:47] (03Merged) 10jenkins-bot: Correct mediawiki-history job bugs [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428922 (https://phabricator.wikimedia.org/T192841) (owner: 10Joal) [15:09:26] elukey: other grafana thing: have you had problems with the default all selector in prometheus template variables? [15:09:56] what kind of problems? [15:10:08] I usually set the custom value .* [15:10:24] so those tons of pings yesterday [15:10:29] happened because I had .* as the selector [15:10:31] but all that does [15:10:38] is replace the variable, e.g. $kafka_brokers [15:10:45] in prometheus metric queries [15:10:46] with .* [15:11:00] so, i was using the $kafka_brokers to e.g. query for network stats [15:11:05] but EVERY node has a network stat [15:11:06] so [15:11:10] yeah [15:11:14] instance=~"$kafka_brokers.*" [15:11:16] selected all nodes [15:11:18] so [15:11:24] you'd think you could just leave the default All value [15:11:26] without using .* [15:11:43] but, that ends up only selecting one of the nodes in the template list if you select ALL [15:11:50] which seems like a bug [15:12:22] my workaround was to keep .* as custom all value [15:12:29] but also add the cluster=$cluster param [15:12:41] i made $cluster a hidden ttemplate var [15:12:49] that gets selected automatically based on $kafka_cluster [15:13:24] in grafana you mean? [15:15:24] ya [15:15:29] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4157931 (10Ottomata) I don't love it! I feel like 4Mb is already huge. Consider troubleshooting some problem with `kafkacat -C | jq .`. Gotta co... [15:16:02] ottomata: I'm confused by the existence of both wmf and master branches here [15:16:09] wmf is the most recently updated [15:16:15] https://github.com/wikimedia/analytics-ua-parser-uap-java [15:16:18] fdans: in what? [15:16:21] oh uap [15:16:23] looking [15:16:31] but the ua-parser repo points at a commit in the master branch [15:16:43] oh hm [15:16:46] that i don't know [15:16:48] i would guess: [15:16:52] master is supposed to track upstream [15:17:02] and wmf has our changes [15:17:09] so when updating i guess we'd do [15:17:24] git pull upstream/master into master [15:17:28] git checkout wmf [15:17:30] Thanks for the merge mforns :) [15:17:30] git merge master [15:17:37] < make edits to pom> [15:17:38] commit [15:17:48] joal, np :] [15:18:20] fdans: dunno why uap-parser would point to master though [15:18:45] hmm [15:18:46] fdans: perhaps [15:18:55] someone just didn't update to point at wmf after https://github.com/wikimedia/analytics-ua-parser/commit/13c20e949bbe0501350e118133d4cd222bc4721c ? [15:18:58] because [15:19:07] uap-parser doesn't do anything except contain the submodules [15:19:08] so [15:19:20] to update, one could edit the wmf branch in local uap-java [15:19:25] update ../uap-core [15:19:28] and just build and upload to archiva [15:19:31] without making any commits [15:19:40] maybe that's what happened? [15:20:05] right [15:21:52] so speaking of uploading to archiva ottomata, do i need to do anything besides changing the version to 1.3.1-wmf6-SNAPSHOT and do mvn deploy? [15:22:26] I mean, does maven package the regexes by getting them from ../uap-core? [15:22:32] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 5 others: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137#4158003 (10Pchelolo) > Consider troubleshooting some problem with kafkacat -C | jq . Haha :) > That said, I'm not opposed, as I don't know of any... [15:28:54] ottomata: ping (sorry!) [15:29:21] fdans: sorry! hm, i don't know if mvn deploy is set up for uap repos [15:29:29] i think you have to manually upload...unless you want to set up mvn deploy :D [15:29:56] probalby wouldn't be too hard [15:30:04] just making the uap-java pom have some stuff that refinery-source does [15:30:23] https://wikitech.wikimedia.org/wiki/Archiva#Deploy_to_Archiva [15:30:52] ottomata: yeah i added the authentication stuff to my ~/.m2/settings.xml [15:31:36] ok cool, yeah probably adding that distributionManagement stuff to our wmf branch in uap-java woudl be cool [15:32:57] ottomata: the only thing is that the stuff in pom.xml is 1.3.1-SNAPSHOT [15:33:01] no mention of the wmf stuff [15:33:35] ottomata: ops sync?? [15:33:46] OO coming [15:42:24] 10Analytics, 10Analytics-General-or-Unknown, 10Mobile-Apps, 10Wikimedia-Interwiki-links, and 3 others: there should be a comparison of clicks count on interlanguage links on different platforms - https://phabricator.wikimedia.org/T78351#4158174 (10Pginer-WMF) [15:43:56] 10Analytics, 10Analytics-General-or-Unknown, 10Mobile-Apps, 10Wikimedia-Interwiki-links, and 3 others: there should be a comparison of clicks count on interlanguage links on different platforms - https://phabricator.wikimedia.org/T78351#4158177 (10Arrbee) a:03Amire80 [15:48:37] 10Analytics, 10Analytics-General-or-Unknown, 10Wikimedia-Interwiki-links, 10Wikipedia-Android-App-Backlog, and 2 others: there should be a comparison of clicks count on interlanguage links on different platforms - https://phabricator.wikimedia.org/T78351#4158199 (10Amire80) >>! In T78351#4081407, @Milimetr... [16:15:13] (03CR) 10Nuria: "Could we possibly add unit tests to this fix?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428922 (https://phabricator.wikimedia.org/T192841) (owner: 10Joal) [16:19:24] ottomata: I'm around round now if you want to troubleshoot [16:22:05] neilpquinn: got meetings for next 40 mins, then free [16:22:12] actually i might need to get some lunch there too [16:22:18] in 1h40 mins ok? [16:22:27] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: [EPIC] Reading List Sync service analytics - https://phabricator.wikimedia.org/T191859#4158432 (10mpopov) >>! In T191859#4156112, @Tbayer wrote: >> Method 1 has the disadvantage that we would be a... [16:23:09] 10Analytics, 10Product-Analytics, 10Reading List Service, 10Reading-Infrastructure-Team-Backlog, and 3 others: [EPIC] Reading List Sync service analytics - https://phabricator.wikimedia.org/T191859#4158436 (10mpopov) [16:37:44] joal: any pointers to where the oozie email comes from? Like how that 3.8 million number is computed in the first place? [16:41:43] hm, on the last hour of the month that script is gonna freak out, but hopefully we remember [16:42:19] milimetric: it comes from the oozie/webrequest/load folder [16:42:48] I think possibly check_sequence_statistics_workflow.xml [16:43:23] thanks, will look [16:44:03] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 3 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4158584 (10mobrovac) [16:54:09] RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [16:59:19] fdans: want to talk about ua parser? [16:59:34] nuria_: we're in batcave-2! [16:59:39] fdans: ok [17:05:20] joal: just update all AQS cassandra dashboards in grafana :) [17:05:33] look at https://grafana.wikimedia.org/dashboard/db/aqs-cassadra?orgId=1 [17:05:48] credits to Eric for the awesome work [17:11:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Unify, if possible, AQS and Restbase's cassandra dashboards - https://phabricator.wikimedia.org/T193017#4158678 (10elukey) Cloned all the Cassandra dashboards created by the Services team, and adapted to AQS. The only change that I had to... [17:47:39] PROBLEM - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 1.003e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [17:48:01] hm why alerting... [17:52:33] oops blacklisted the topics on the wrong mm instance! [17:52:51] fdans: sorry meetings interrupted our convos before [17:53:09] fdans: the version stuff in the wmf branch? [17:53:13] dunno. feel free to add it [17:53:14] :) [17:53:15] ottomata: don't worry! we had an investigative marathon in the cave :) [17:58:51] RECOVERY - Kafka MirrorMaker main-eqiad_to_eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_eqiad [18:01:50] joal: can we talk about deleting 2018-03 bad segment from druid? [18:02:32] nuria_: yes [18:02:35] nuria_: cave? [18:02:43] * elukey off! [18:02:44] joal: i am in meeting [18:02:51] joal: can talk in 30 mins [18:03:00] ok [18:03:44] nuria_: I can do it if you want, but I'm assuming you wish to discuss and action, right? [18:04:02] joal: yes [18:28:10] Amir1: o/. Feeling mergey today? [18:28:17] I want to get this on beta, https://phabricator.wikimedia.org/T192917 [18:28:44] awight: yeah, just ores in beta is completely dead [18:28:50] oof [18:29:00] better new code than none, in that case [18:29:57] U know if it’s anything exotic that went wrong with ores-beta, or maybe I just left it borken after my last deployments? [18:30:59] I think it has been broken for a while now [18:31:33] awight: both merged [18:31:39] :) thanks! [18:35:07] awight: keep me posted, cause I need it for testing wp10 storage in mediawiki [18:35:18] cool—I’m pushing 2.2.2 now [18:36:57] Amir1: https://gerrit.wikimedia.org/r/#/c/427015/ [18:37:01] there will be one more... [18:37:29] merged :D [18:38:16] I forgot to blink [18:38:28] Should be the last one, https://gerrit.wikimedia.org/r/428975 [18:47:39] (03PS1) 10Joal: Add unittest for a Mediawiki-History function already fixed [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428977 (https://phabricator.wikimedia.org/T192841) [18:49:07] nuria_: druid talk? [18:49:12] joal: yes [18:54:24] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4159071 (10Ottomata) [19:08:06] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4159116 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1058.eqiad.wmnet'] ``` T... [19:08:09] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4159117 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1059.eqiad.wmnet'] ``` T... [19:10:27] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Wikistats Bug: all but 2018 data missing? - https://phabricator.wikimedia.org/T192841#4159119 (10Nuria) Data up to 2018-02 is now 2018-02 snapshot, removed last segment 2018-03 [19:20:37] 10Analytics, 10Analytics-EventLogging, 10Need-volunteer: Add Composer support - https://phabricator.wikimedia.org/T60459#4159168 (10Umherirrender) 05Open>03declined Per T467 [19:23:43] (03CR) 10Mforns: [C: 032] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428977 (https://phabricator.wikimedia.org/T192841) (owner: 10Joal) [19:29:30] (03Merged) 10jenkins-bot: Add unittest for a Mediawiki-History function already fixed [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428977 (https://phabricator.wikimedia.org/T192841) (owner: 10Joal) [19:45:04] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4159240 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1058.eqiad.wmnet'] ``` and were **ALL** successful. [19:45:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4159242 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1059.eqiad.wmnet'] ``` and were **ALL** successful. [20:09:12] 10Analytics, 10EventBus, 10Services: Enable snappy compression for eventbus Kafka producer - https://phabricator.wikimedia.org/T193080#4159283 (10Ottomata) [20:09:22] 10Analytics, 10EventBus, 10Services: Enable snappy compression for eventbus Kafka producer - https://phabricator.wikimedia.org/T193080#4159294 (10Ottomata) [20:11:51] neilpquinn: yt? [20:19:05] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4159316 (10Ottomata) Ok, I think we are ready to do this! If there are no objections, I'll start on codfw tomorrow. [21:23:31] (03CR) 10Nuria: [C: 031] "I undid the chnages here and tests fails which is what we would expect :https://gerrit.wikimedia.org/r/#/c/428922/1/refinery-job/src/main/" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/428977 (https://phabricator.wikimedia.org/T192841) (owner: 10Joal) [22:06:20] ottomata: yt? If yes can you do me a favor and kill a sparklyr app running on Yarn for me, please? [22:13:30] Ok, whoever can kill a Spark api running on Yarn, please do it: https://yarn.wikimedia.org/cluster/app/application_1523429574968_52985 - reason: testing {sparklyr}, exiting R without disconnecting from the cluster first == cannot access the app through Yarn anymore (SparkR does not behave that way). Thanks. [22:26:07] 10Analytics, 10Discovery-Analysis, 10Product-Analytics: Get 'sparklyr' working on stats1005 - https://phabricator.wikimedia.org/T139487#4159654 (10GoranSMilovanovic) @Ottomata @mpopov Good news: {sparklyr} can connect to Yarn (tested from stat1005, documenting very soon on [[ https://wikitech.wikimedia.org... [23:15:45] 10Analytics, 10Analytics-Kanban: Update UA parser - https://phabricator.wikimedia.org/T189230#4159736 (10Nuria) [23:20:28] Speaking of parsing UA, nuria_ can you please point me to the regex file used when refining web requests (I need to look at it for something very urgent and important) [23:22:51] bearloga: current one is here: https://github.com/wikimedia/analytics-ua-parser-uap-core/commit/095b648d9350854e05d442137326d73c3a401882#diff-a49fec7741e4d88d338f7d2f28c20dda [23:22:59] nuria_: thank you very much! [23:23:08] bearloga: note commit, current one refinery uses is not latest commit on file [23:23:17] bearloga: but rather the one i sent you [23:24:01] nuria_: thank you for the heads-up! [23:26:48] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Pull wmf ua-parser-core from the upstream repo - https://phabricator.wikimedia.org/T192465#4159763 (10Nuria) Changes so tests pass are needed in OSParser.java. See specification, replacements can be regex backreferences. cc @mforns and @fdans Trying to pu... [23:36:01] nuria_: hey so I'm getting errors trying to read that file into Python "yaml.scanner.ScannerError: mapping values are not allowed here in "regexes.yaml", line 315, column 24" (I get same error in R) so you maaaaaaayyyyyyy want to check if Java and UA parser in the refinery can read the whole thing without problems [23:36:58] https://github.com/wikimedia/analytics-ua-parser-uap-core/blob/095b648d9350854e05d442137326d73c3a401882/regexes.yaml specifically [23:38:26] nuria_: NEVERMIND SORRY [23:40:04] nuria_: so sorry to alarm you, I forgot to get the raw file from github [23:40:35] * bearloga feels so embarrassed right now [23:56:05] :) aw, 'sall good!