[00:53:41] (PS7) Bearloga: Functions for categorizing queries. [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254461 (https://phabricator.wikimedia.org/T118218) [10:03:19] * elukey is puzzled about smartctl on kafka1012 [11:26:42] I updated https://phabricator.wikimedia.org/T125199 with some findings, really weird [12:18:35] hey a-team! [12:18:48] Analytics-Tech-community-metrics, DevRel-February-2016: top-contributors.html is not sorted by rank anymore - https://phabricator.wikimedia.org/T125797#1997521 (Aklapper) NEW [12:19:14] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1997527 (Aklapper) >>! In T64221#1785541, @Aklapper wrote: >>>! In T64221#1781654, @Qgil wrote: >> These two empty se... [12:19:50] Analytics-Tech-community-metrics, DevRel-February-2016: top-contributors.html is not sorted by rank anymore - https://phabricator.wikimedia.org/T125797#1997521 (Aklapper) [12:19:52] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1997528 (Aklapper) [12:20:46] mforns o/ [12:22:22] hi elukey how is it going? [12:23:06] Analytics-Tech-community-metrics, DevRel-February-2016: top-contributors.html is not sorted by rank anymore - https://phabricator.wikimedia.org/T125797#1997531 (Aklapper) Hmm. Reloading, it now works as expected again and sorted correctly. Might be some race condition or such. [12:29:11] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1997545 (Aklapper) I think we can close this task as "good enough" once 1) **Input welcome:** we have decided whethe... [12:29:53] hey mforns :) [12:43:24] joal, hi! [12:50:46] mforns: goooood! I am still investigating the weird disk issue in kafka1012, something is still not clear to me [12:50:49] joal: o/ [12:51:00] \o elukey [14:46:49] good morning a-team! [14:46:54] joal: how dem dar jobs lookin? [14:46:57] hi ottomata :] [14:47:03] hey ottomata [14:47:25] Cluster still very busy, uniques backfilling is for something [14:48:02] hm, aye k, the data move finished? [14:48:03] And some appSessionMetrics failed, but I think it's because of resource constraints, I'll relaunch them when it'll be quieter [14:48:07] k [14:48:20] it finished yesterday night, and I started everything [14:48:36] Only breakage so far is AppsSessionMetricsGlobal [14:48:54] And uniques monthly still running ( but it could have been expected) [14:51:11] K COOL [14:51:13] WHOA CAPS [14:53:18] ottomata: Since it's still babysitting day, let's go for the cassandra one [14:53:26] hey yall, moritzm wants to reboot eventlog1001 [14:53:32] i'm going to stop eventlogging and let him do so [14:53:41] k [14:53:42] s'ok a-team? [14:53:45] mforns: ok? [14:54:01] ok ottomata [14:54:18] sok with me [14:57:28] !log stopping eventlogging to reboot eventlog1001 for kernel update [14:59:35] joal: cassandra one? [14:59:40] yessir [15:00:00] Ummmm not remembering [15:00:12] resetbase, from gwicke [15:00:20] oH [15:00:21] right [15:00:22] haha [15:00:24] was thikning oozie job [15:00:25] thanks! [15:00:26] yes [15:04:16] joal: applied [15:04:19] can you restart restbase? [15:04:30] k ottomata, then we need to wait for a puppet run, correct ? [15:05:01] already done [15:05:08] great ottomata [15:05:24] Then, only thing is to restart restbase I think [15:05:24] hm one sec [15:05:27] i think 1001 doesn't have it applied [15:05:31] k [15:05:37] cmoon puppet [15:05:50] thar we go [15:05:53] ok good to restart [15:06:09] ottomata: one by one please :) [15:06:18] And, please wait a minute [15:06:22] ottomata: --^ [15:06:44] ottomata: GO ! [15:06:50] uhhh [15:06:55] do i just do service restbase restart? [15:07:07] hm, I think so [15:07:08] joal: ? [15:07:10] haha [15:07:10] k [15:07:12] milimetric: confirm? [15:07:18] I don't know if cassandra needs a restart [15:07:35] or if the read local applies in restbase only [15:09:39] ha, me neither! [15:09:44] would like to have some more info [15:09:49] ottomata: currently reading on the topic [15:09:50] gwicke: only mentinoed restbase restart yesterday [15:09:55] hm, i will ask urandom [15:10:09] then restbase restart, that is good (and whatI'd have expected) [15:12:34] joal: I've been reading up I'm not seeing how this relates to restbase [15:12:35] ottomata: restbase restart is sure, my wodner is about cassandra [15:12:42] Analytics-EventLogging, DBA: Potentially decrease db1046's InnoDB buffer pool - https://phabricator.wikimedia.org/T125829#1998057 (jcrespo) NEW a:jcrespo [15:13:10] joal: urandom confirms, doing restart one at a time [15:13:22] awesome :) [15:13:24] no idea what you guys are mumbling about [15:13:29] :) [15:13:39] !log restarting aqs restbase 1 node at a time [15:13:45] Lowring read consistency level of restbase on cassandra milimetric [15:14:05] looks ok [15:14:10] how long should i wait before I do the next one? [15:14:11] * milimetric is even more lost now [15:14:15] huhu [15:14:25] * milimetric feels like he missed some IRC messages [15:14:36] ottomata: a few seconds should be ok I think [15:15:01] ottomata: currently monitoring latency charts [15:15:55] Charts are not really up-to-date, so difficult to follow :) [15:17:13] k just restarted 1002 [15:17:21] looks ok [15:18:07] ottomata: it's quiet time for them (just disk seek), so should be ok [15:19:01] ok, just restarted on 1003 [15:19:08] also looks fine [15:19:09] cool [15:19:15] great [15:23:43] oh by the way ottomata : I tested hive MSCK REPAIR yesterday: works like a charm to restore partitions in metastore from folder hierarchy ! [15:25:40] milimetric, is there any existing dashiki config wiki page that uses the textual format? I'm reviewing your patch [15:25:42] awesome! for what table? [15:25:50] as long as it has the partition keys in the folder names, ja? [15:25:55] ottomata: mmmm so what do we need to do to restart EL hosts? anthing that might go in https://wikitech.wikimedia.org/wiki/Service_restarts ? [15:26:00] mforns: yes, I just changed Config:TestTabs to have it [15:26:06] cool thx! [15:26:17] aqs hourly ottomata : I had a table in my db, therefore created one in wmf db, moved the files, and MSCK REPAIR --> done in minutes ! [15:26:25] mforns: also though the default config has it, if you navigate to src/layouts/tabs/ [15:26:37] I see milimetric [15:26:45] elukey: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall#Restart_EventLogging [15:26:47] mforns: the build part of this layout is broken still, it's very painful [15:26:51] although, i'm going to edit that [15:26:53] i don't think restart works [15:27:09] edited. [15:27:11] stop && start [15:27:11] milimetric, it worked for me... [15:27:21] it doesn't copy the new fonts from semantic2 [15:27:30] I see [15:27:33] I was trying to fix it last night and I'm still trying [15:27:48] elukey: if you edit service_restarts, maybe just link there, because it is probably going to change if we upgrade to jessie and use systemd [15:28:02] haha [15:28:04] its here twice already! [15:28:05] https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall#Restart_EventLogging_2 [15:28:18] ha, did I add that? [15:28:44] I consolidated the page a while ago from some snippets found here and there [15:28:52] nice [15:28:53] I might have done it without knowing it [15:29:19] I am always scared when we talk about stopping EL [15:29:34] is there any risk of loosing data or is it completely fine? [15:29:40] just to know what NOT to do [15:30:35] ja, there could be a little loss [15:30:42] but mostly not [15:30:57] usually queued messages will be inserted into mysql before it fully stops [15:30:59] (PS3) Milimetric: Add textual table "visualization" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) [15:31:07] and the next time it starts, it picks up from previous offsets [15:31:08] mforns: ok, new patch up, all works now [15:31:12] which causes more duplicates than loss [15:31:24] * mforns looks [15:33:24] mforns: if you set the visualizer type of the Sessions metric to table-timeseries you'll see it handles that data being HUGE by setting the filter to apply only when you press enter [15:34:04] milimetric, aha [15:38:35] Analytics-Tech-community-metrics, pywikibot-core, DevRel-February-2016, Upstream: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1998117 (Lcanasdiaz) Half of the work is done. I'm going to work on the visualization of t... [15:40:24] ahhh yes the batch inserts [15:48:01] (PS4) Milimetric: Add textual table "visualization" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) [15:48:14] mforns: I forgot I had made some modifications to the timeseries data core, so I added some tests ^ [15:48:28] milimetric, awesome [16:03:12] Analytics-Kanban, RESTBase: Update AQS config to new config format {melc} [5 pts] - https://phabricator.wikimedia.org/T122249#1998202 (Milimetric) [16:03:43] joal: whenever you have time can we chat about https://phabricator.wikimedia.org/T125199 ? [16:04:04] (brb) [16:08:10] Hey elukey [16:08:34] I'm ignorant in SMART (which makes sense) [16:08:48] We can chat if you want but I don't think I can help [16:09:16] yessss I just wanted to talk about the next steps [16:09:26] do you have a minute in the batcave? [16:09:31] sure [16:19:20] Analytics-Tech-community-metrics, pywikibot-core, DevRel-February-2016, Upstream: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1998274 (Lcanasdiaz) Dropdown widget fixed. Ready to be deployed. https://github.com/VizG... [16:20:56] (CR) Mforns: [C: -1] "Everything LGTM, but somehow I can not download the larger datasets. See comments. If you think this is not too important, I'm OK with mer" (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) (owner: Milimetric) [16:32:55] Analytics-Kanban: Rotate kafka GC logs [3 pts] {hawk} - https://phabricator.wikimedia.org/T124644#1998318 (Nuria) Open>Resolved [16:33:11] Analytics-Kanban: Debug wikimetrics docker dev setup failing on ubuntu 14.04 - https://phabricator.wikimedia.org/T125415#1998320 (Nuria) Open>Resolved [16:34:31] Analytics-Kanban, Editing-Analysis, Patch-For-Review: Queries for the edit analysis dashboard failing since December 2015 [5 pts] - https://phabricator.wikimedia.org/T124383#1998324 (Nuria) Open>Resolved [16:35:08] Analytics-Kanban, Editing-Analysis: Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts] - https://phabricator.wikimedia.org/T124676#1998326 (Nuria) Also, please see: https://phabricator.wikimedia.org/T124383 When issues with size are fixed these jobs can... [16:36:26] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1998327 (Nuria) Open>Resolved [16:37:40] Analytics-Kanban, Patch-For-Review: Create a dedicated hive table with pageview API only requests for reporting [5 pts] {melc} - https://phabricator.wikimedia.org/T118938#1998329 (Nuria) Open>Resolved [16:38:06] Analytics-Kanban, Patch-For-Review: Create a dedicated hive table with pageview API only requests for reporting [5 pts] {melc} - https://phabricator.wikimedia.org/T118938#1813225 (Nuria) Resolved>Open [16:56:59] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#1998395 (Sadads) @Legoktm: did the analysis help? I think adding the feature desirable but not... [17:00:34] a-team joining standup in a second. [17:00:36] a-team: standduppp [17:01:04] coming! [17:01:43] ottomata: holaaaa [17:02:42] EEE [17:24:09] (PS5) Milimetric: Add textual table "visualization" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) [18:02:19] Analytics-Kanban: Remove cron on wikimetrics instance that updates vital signs [1 pts] - https://phabricator.wikimedia.org/T125751#1998780 (Milimetric) [18:03:16] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Create and maintain an Analytics Cluster in Beta Cluster in labs. [21 pts] - https://phabricator.wikimedia.org/T109859#1998782 (Ottomata) [18:03:50] Analytics-Cluster, Analytics-Kanban, EventBus, Patch-For-Review: Camus job to import mediawiki.* eventbus data to Hadoop. [8 pts] - https://phabricator.wikimedia.org/T125144#1998783 (Ottomata) [18:04:37] Analytics, Analytics-Cluster, hardware-requests, operations: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1998793 (Milimetric) [18:05:15] Analytics, hardware-requests, operations, Patch-For-Review: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1998797 (Milimetric) [18:06:23] Analytics-Kanban: Bookmark-able graphs in Dashiki tabular layout [3 pts] {lama} - https://phabricator.wikimedia.org/T124298#1998807 (Milimetric) [18:07:00] Analytics: analytics specific icinga alerts should ping in our IRC channel. - https://phabricator.wikimedia.org/T125128#1998809 (Milimetric) [18:07:17] Analytics: analytics specific icinga alerts should ping in our IRC channel. - https://phabricator.wikimedia.org/T125128#1998811 (Milimetric) p:Triage>Normal [18:07:43] Analytics-Kanban: Lower parallelization on EventLogging to 1 consumer {oryx} - https://phabricator.wikimedia.org/T125225#1998813 (Milimetric) [18:08:20] Analytics: Add regexps that match the bots that follow the User-Agent policy {hawk} - https://phabricator.wikimedia.org/T125731#1998819 (Milimetric) p:Triage>Normal [18:08:43] Analytics: Add regexps that match the bots that follow the User-Agent policy {hawk} - https://phabricator.wikimedia.org/T125731#1996003 (Milimetric) research whether the new expressions that we're adding have a match on the cluster [18:09:34] Analytics, Analytics-Cluster, operations: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1998824 (Milimetric) Open>stalled No actionables. If we install we'll use RAID [18:09:59] Analytics, Analytics-Cluster, operations: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1998826 (Milimetric) stalled>Resolved a:Milimetric [18:10:50] Analytics: Data integrity on Analytics Kafka nodes - https://phabricator.wikimedia.org/T125650#1998840 (Milimetric) Open>Resolved Kind of the same as T99105, if we move to RAID that'll probably solve this [18:11:16] Analytics-Kanban, Wikipedia-Android-App: Database not updated for beta event logging and all-events.log reports 8x for each event [3 pts] - https://phabricator.wikimedia.org/T125423#1998848 (Milimetric) [18:11:27] Analytics: Upgrade Dashiki to semantic-2 for all layouts - https://phabricator.wikimedia.org/T125409#1998851 (Milimetric) [18:11:44] Analytics: Upgrade Dashiki to semantic-2 for all layouts - https://phabricator.wikimedia.org/T125409#1986826 (Milimetric) p:Normal>Low [18:14:57] Analytics: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#1998905 (Milimetric) [18:15:07] Analytics: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#1998907 (Milimetric) p:Triage>Low [18:18:39] Analytics: Pageview API: Better filtering of bot traffic on top enpoints - https://phabricator.wikimedia.org/T123442#1998927 (Milimetric) [18:18:41] Analytics: Pageview API should filter artificial traffic - https://phabricator.wikimedia.org/T125361#1998926 (Milimetric) [18:20:18] Analytics-Kanban: Eventlogging should start with one bad kafka broker, retest that is the case - https://phabricator.wikimedia.org/T125228#1998946 (Milimetric) [18:20:20] Analytics: Handle EventLogging's pykafka connection errors gracefully {oryx} - https://phabricator.wikimedia.org/T125207#1998945 (Milimetric) [18:21:00] Analytics, Editing-Analysis, Editing-Department: Consider scrapping Schema:PageContentSaveComplete and Schema:NewEditorEdit, given we have Schema:Edit - https://phabricator.wikimedia.org/T123958#1998956 (Milimetric) a:Milimetric>None [18:21:44] Analytics, Analytics-Wikimetrics: Some special characters break Wikimetrics' encoding {dove} - https://phabricator.wikimedia.org/T114884#1998961 (Milimetric) p:Triage>Normal [18:22:18] Analytics, Analytics-Cluster: Remove refinery-hive.jar from hive-site.xml - https://phabricator.wikimedia.org/T114769#1998963 (Milimetric) p:Triage>Normal [18:24:01] Analytics: Spike: Can we have a production Event Logging endpoint from labs? - https://phabricator.wikimedia.org/T114503#1999002 (Milimetric) Open>Resolved a:Milimetric DONE YAYYAAY!! :) [18:26:04] Analytics: Visualization of Zika access by geography - https://phabricator.wikimedia.org/T125856#1999014 (Milimetric) NEW a:Milimetric [18:27:48] Analytics: Investigate US traffic by state normalized by population - https://phabricator.wikimedia.org/T114469#1999029 (Milimetric) p:Triage>Normal [18:28:36] Analytics, Documentation: Clean up and possibly refine UDP sampled logs (which go back to 2014) - https://phabricator.wikimedia.org/T114381#1999039 (Milimetric) p:Triage>Normal [18:29:18] Analytics: Notify potential users of the UDP sampled logs that we're preparing to purge them - https://phabricator.wikimedia.org/T114380#1999045 (Milimetric) [18:29:20] Analytics, Documentation: Clean up and possibly refine UDP sampled logs (which go back to 2014) - https://phabricator.wikimedia.org/T114381#1693705 (Milimetric) [18:32:11] holy spam :) sorry all - grooming is real [18:33:04] mforns: in case you have energy left today, I passed the download link instead of re-processing the data as text [18:33:16] so now it downloads ok but the filename is not pretty any more [18:33:44] which is the best I can do without changing the file itself, I think [18:53:08] a-team, I'm off for tonight ! [18:53:12] See you tomorrow :) [18:55:50] bbbbyyyeeeee [18:58:14] milimetric: Hi. Does pageview_hourly include pageviews for special pages too? [18:58:17] * MarkTraceur looks around for halfak [18:58:51] I had a question about the xmldatadumps, if anyone has any thoughts [18:59:27] bye joal! [19:00:29] The rough idea is that I want to import data from them, via a Python script and mysqlimport, into tables on stat1003. I'm doing it manually now. Then I want to run reportupdater on the data I've imported, and those queries are already in review (but I mentioned in the commit message that it should probably wait a bit)...is there any way to run a script after [19:00:29] each dump is done? Is there a way to run reportupdater scripts after each dump is imported? [19:00:59] Oh, and is there a way to get revision data for only those revisions not included in past dumps [19:01:41] halfak told me to use -pages-meta-history*.xml*.bz2 but maybe there's a different set of files that would serve me better [19:10:31] madhuvishy: I am going offline in a bit, tomorrow I'll try to take a look to the Go template for Burrow.. BUT if you want to do it today, let me know! [19:11:11] elukey: it should be easy enough - just adding a file to the puppet module - not sure i'll have time today - we can pair on it tomorrow morning if you want [19:14:58] sure! [19:15:29] Analytics-Tech-community-metrics, DevRel-February-2016: What is contributors.html for, in contrast to who_contributes_code.html and sc[m,r]-contributors.html and top-contributors.html? - https://phabricator.wikimedia.org/T118522#1999309 (Aklapper) @Lcanasdiaz: Do you think that it should be in upstream? :) [19:16:37] elukey: okay :) [19:35:25] milimetric, yt? [19:38:54] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) (owner: Milimetric) [19:51:58] ottomata: does this look right to you to run the el processor [19:52:00] bin/eventlogging-processor "%q %{recvFrom}s %{seqId}d %t %h %{userAgent}i" "kafka:///kafka203:9092?zookeeper_connect=conf100.analytics.eqiad.wmflabs:2181&topic=el-test&auto_offset_reset=-2" "file:///home/madhuvishy/out.txt" [19:53:42] not quite [19:53:49] zookeeper_connect=conf100.analytics.eqiad.wmflabs:2181/kafka/... [19:53:50] something [19:54:00] whatever it says in kafka203 in /etc/kafka/server.properties [19:54:08] ah okay [19:57:17] ottomata: it was throwing NoPartitionsForConsumerException - which dint make any sense until you pointed out that zk url was wrong [20:47:13] ottomata: as far as i tried - eventlogging processor rebalances itself if out of a cluster of 3 - 1 or 2 kafka brokers go down - it dies when all three are down which makes sense [20:47:33] right [20:47:39] madhuvishy: that is what I saw too when I tested [20:47:45] madhuvishy: even if you take the first broker out of the list? [20:47:57] what happens if instead of stopping kafka, you shutdown the labs instance [20:47:58] or [20:48:06] use iptables to block all incoming connections on that port [20:48:11] block/ignore [20:48:16] ottomata: as in - pass only one broker to the url? [20:48:20] n [20:48:20] o [20:48:25] as in, pass 3, but stop the first one [20:48:33] the issue we saw in prod happened when kafka1012 died [20:48:37] and itis the first broker in the list [20:48:54] ya i stopped the first one [20:48:56] it was fine [20:49:16] let me try the other edge cases [20:56:05] bmansurov: yes [20:56:17] thanks [20:56:19] pageview_hourly includes pageviews for some special pages [20:56:34] ok [20:57:22] bmansurov: a really simple check is to look at the pageview API, which is fed from pageview_hourly, and as you can see: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Special:Search/daily/2015100100/2015103000 [20:57:54] milimetric: great, thanks for the tip [20:58:03] mforns: hey [20:58:32] hey milimetric I had a question, but found the answer myself, nm! [20:58:43] merged your patch [20:59:56] (CR) Milimetric: [C: 2 V: 2] "self-approving it's just a tiny simple change" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267910 (owner: Milimetric) [21:00:28] (CR) Milimetric: [C: 2 V: 2] "self-approving 'cause it's just a clean-up" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267389 (owner: Milimetric) [21:02:50] (those are all on a chain, once the tabular layout gets merged they'll all auto-merge) [21:03:08] madhuvishy: ja [21:03:12] i expect them to work [21:03:15] cause thats what I tested! [21:03:16] :) [21:03:18] but [21:03:21] i did not test stopping the whole box [21:03:29] i just tested stopping hte kafka broker process [21:03:30] lets try to break it a bit more :) [21:03:35] do the iptables thing [21:03:54] you want drop [21:04:07] http://www.cyberciti.biz/faq/iptables-block-port/ [21:04:28] iptables -A INPUT -p tcp --destination-port 9092 -j DROP [21:05:26] run that on the first broker while looking at EL logs [21:05:30] and then also maybe restart the EL process [21:12:14] ottomata: okay trying now [21:16:49] ottomata: some weirdnesss [21:17:42] ottomata: when i dropped the tables - the process dint do anything - just hung there for a while. So i killed and restarted the processor - and even if i produce to another broker in the cluster [21:17:55] it is hung on [21:18:02] https://www.irccloud.com/pastebin/2mIoZk7C/ [21:18:37] huH! [21:18:43] very similar to the mediawiki problem!!! :) [21:18:49] iintteresting [21:18:50] https://www.irccloud.com/pastebin/ZSSf3EUP/ [21:18:58] madhuvishy: can you find out what just a producer does in this case? [21:19:00] ottomata: now this happened [21:19:02] consume from stdin or osmething [21:19:08] madhuvishy: back, let me know if you want me to show you my changes for cron on wikimetrics instance so you know where things are [21:19:15] who [21:19:16] a [21:19:17] that's strange [21:19:20] zk on kafkaxxx? [21:19:24] is your url correct? [21:19:31] ottomata: i think so [21:19:53] bin/eventlogging-processor "%q %{recvFrom}s %{seqId}d %t %h %{userAgent}i" "kafka:///kafka203:9092,kafka200:9092,kafka201:9092?zookeeper_connect=conf100.analytics.eqiad.wmflabs/kafka/analytics-analytics&topic=el-test&auto_offset_reset=-2" [21:19:53] "kafka:///kafka203:9092,kafka200:9092,kafka201:9092?zookeeper_connect=conf100.analytics.eqiad.wmflabs/kafka/analytics-analytics&topic=el-test2" [21:20:12] ottomata: ^ [21:21:30] ya looks good [21:21:31] that is funky [21:21:37] why would it try to use the kafka brokers as zk connect? [21:22:03] madhuvishy: in the first case, when kafka203 timed out [21:22:12] ottomata: I dont know! it gave the same error when i killed all three brokers one by one [21:22:12] it did nothing during that time? [21:22:16] how long did that take? [21:22:20] oh ok [21:22:22] then that is probably fine [21:22:32] its ok if it prints that, as long as it keeps working [21:23:07] ottomata: in both cases - it died - but i meant at the end of killing all three brokers - not in between each step [21:23:22] it died now too - timed out and died [21:24:02] ah [21:24:10] ok confused [21:24:16] so, all 3 brokers up [21:24:18] things working [21:24:20] then [21:24:28] iptables -DROP [21:24:31] then what happens? [21:24:34] does it keep working? [21:24:40] iptables -DROP on just kafka203 [21:25:00] ottomata: no [21:25:05] it just hung there [21:25:20] then i killed and restarted, it gave the errors i pasted [21:25:22] and died [21:25:36] i restarted again now [21:25:44] it did [21:25:48] https://www.irccloud.com/pastebin/HXrX9vdB/ [21:26:21] i'm confused too [21:26:24] is it working when it does that? [21:26:35] when you restart it [21:26:36] ? [21:26:41] or does it hang for 30 seconds doing nothing? [21:26:48] does it start to work after 30 seconds? [21:26:57] looks like it, yes? [21:27:35] ottomata: hmmmm - i'm a bit confused as to what it's doing - we can batcave - or i can repeat what i did again to make sure [21:28:07] gimme a few mins then ja [21:30:38] ok madhuvishy les batcave [21:30:42] my stuff is not WORKGIINGNGN [21:30:43] GRRR [21:30:43] :) [21:31:20] ottomata: joining [21:40:24] https://www.irccloud.com/pastebin/Q8PfvkQs/ [21:40:42] Analytics: Make sunburst and stacked-bars resize with window {crow} [3 pts] - https://phabricator.wikimedia.org/T114162#2000056 (Milimetric) [21:41:07] Analytics: Make sunburst and stacked-bars resize with window {crow} [3 pts] - https://phabricator.wikimedia.org/T114162#2000062 (Milimetric) p:Triage>Normal [21:41:15] Analytics: Cassandra Backfill July [5 pts] {melc} - https://phabricator.wikimedia.org/T119863#2000063 (Milimetric) p:Triage>Normal [21:41:23] &socket_timeout_ms=1000 [21:58:37] bin/eventlogging-processor "%q %{recvFrom}s %{seqId}d %t %h %{userAgent}i" "kafka:///kafka203:9092,kafka200:9092,kafka201:9092?zookeeper_connect=conf100.analytics.eqiad.wmflabs/kafka/analytics-analytics&topic=el-test&socket_timeout_ms=1000" stdout:// [21:58:45] ottomata: [22:02:55] Analytics-Kanban, RESTBase, Patch-For-Review: Update AQS config to new config format {melc} [5 pts] - https://phabricator.wikimedia.org/T122249#2000186 (mobrovac) We have just pushed out a change for #RESTBase (v0.10.0) and #service-runner (v1.1.1) which will require another small config change. I'll c... [22:05:36] (CR) Nuria: Functions for categorizing queries. (9 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254461 (https://phabricator.wikimedia.org/T118218) (owner: Bearloga) [22:06:32] * nuria_ secretly hoping those events appearing multiple times in beta labs have sorted themselves [22:13:49] milimetric: i'm guessing you plan to deploy soon given the config change upload? [22:14:08] mobrovac: yeah, i was just catching up on the changes y'all made [22:14:21] oh i forgot you're all in SF [22:14:29] I was waiting to tomorrow morning to ask you :) [22:14:35] you should probably add the utf-8 enconding to the headers [22:14:37] cf https://lists.wikimedia.org/pipermail/analytics/2016-February/004908.html [22:14:42] i saw that, yea [22:14:44] milimetric: hehe, yup, still here [22:15:07] cool, i'll finish something up and add those headers, but is there anything else we need? [22:15:17] is referencing the aqs_default.yaml config still ok? [22:16:46] yup [22:16:51] nice job milimetric! [22:18:22] Analytics-EventLogging, MediaWiki-extensions-MultimediaViewer: 60% of MultimediaViewerNetworkPerformance events dropped (exceeds maxUrlSize) - https://phabricator.wikimedia.org/T113364#2000254 (Tgr) It's temporarily broken due to T91465 (a little bit less temporarily than expected, due to all the train s... [22:20:58] madhuvishy: let's talk about burrow/cron when you are free, i am going to look at mysql on eventlogging [22:29:36] nuria_: dunno what's going on with burrow, but i think its fine [22:29:43] curl http://krypton.eqiad.wmnet:8000/v2/kafka/eqiad/consumer/mysql-m4-master/status | jq . [22:30:02] ottomata: ya, i was looking arround in mysql and could not find obvious issues from graphana [22:30:15] ottomata: let me look at db to see timestamps of events being inserted now [22:32:34] nuria_: i think burrow is noticing a small amount of lag for some reason, and reporting, but it is intermittent [22:40:15] nuria_: ottomata ya - the lag if you notice has dropped from ~30 in the morning to ~17 now [22:40:29] the number of lagging partitions is also 2 now [22:40:33] gone down by 1 [22:40:45] madhuvishy: ...is the lag teh 3rd number reported ? [22:40:50] nuria_: yeah [22:41:02] k [22:41:23] ? [22:41:26] oh [22:41:27] yes [22:41:33] and i htink that is offsets, right? [22:41:39] which is messages, not many [22:41:46] ottomata: timestamp, offsets, lag [22:42:00] lag is in offsets though, ja? [22:42:17] like, 17 messages behind [22:42:18] ottomata: oh yes [22:42:20] ja [22:42:21] yes [22:42:59] ottomata: may be if the mysql consumer is bound to be slow we could make it more permissible [22:43:58] madhuvishy: ya cause 17 messages on an inflow of 200 per sec seems quite reasonable [22:45:15] madhuvishy: i can make it more lenient , are those al;arms in puppet? [22:45:40] nuria_: in puppet yes - but not "alarms" per say [22:45:47] https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules [22:47:52] nuria_: If the consumer offset does not change over the window, and the lag is either fixed or increasing, the consumer is in an ERROR state, and the partition is marked as STALLED. [22:47:58] this is what happened i think [22:48:15] (1454622892611, 199315032, 17) -> (1454622901672, 199315032, 17) [22:48:21] offset hasn't changed [22:48:24] over 10 minutes [22:48:25] madhuvishy: k, added docs here: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall#Burrow [22:48:30] that's our window [22:48:38] and lag is fixed [22:49:03] i'm not sure if nothing from that partition is being consumed yet [22:49:06] ottomata: ^ [22:50:54] ahhh [22:50:57]