[00:53:41] (PS7) Bearloga: Functions for categorizing queries. [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254461 (https://phabricator.wikimedia.org/T118218) [10:03:19] * elukey is puzzled about smartctl on kafka1012 [11:26:42] I updated https://phabricator.wikimedia.org/T125199 with some findings, really weird [12:18:35] hey a-team! [12:18:48] Analytics-Tech-community-metrics, DevRel-February-2016: top-contributors.html is not sorted by rank anymore - https://phabricator.wikimedia.org/T125797#1997521 (Aklapper) NEW [12:19:14] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1997527 (Aklapper) >>! In T64221#1785541, @Aklapper wrote: >>>! In T64221#1781654, @Qgil wrote: >> These two empty se... [12:19:50] Analytics-Tech-community-metrics, DevRel-February-2016: top-contributors.html is not sorted by rank anymore - https://phabricator.wikimedia.org/T125797#1997521 (Aklapper) [12:19:52] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1997528 (Aklapper) [12:20:46] mforns o/ [12:22:22] hi elukey how is it going? [12:23:06] Analytics-Tech-community-metrics, DevRel-February-2016: top-contributors.html is not sorted by rank anymore - https://phabricator.wikimedia.org/T125797#1997531 (Aklapper) Hmm. Reloading, it now works as expected again and sorted correctly. Might be some race condition or such. [12:29:11] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#1997545 (Aklapper) I think we can close this task as "good enough" once 1) **Input welcome:** we have decided whethe... [12:29:53] hey mforns :) [12:43:24] joal, hi! [12:50:46] mforns: goooood! I am still investigating the weird disk issue in kafka1012, something is still not clear to me [12:50:49] joal: o/ [12:51:00] \o elukey [14:46:49] good morning a-team! [14:46:54] joal: how dem dar jobs lookin? [14:46:57] hi ottomata :] [14:47:03] hey ottomata [14:47:25] Cluster still very busy, uniques backfilling is for something [14:48:02] hm, aye k, the data move finished? [14:48:03] And some appSessionMetrics failed, but I think it's because of resource constraints, I'll relaunch them when it'll be quieter [14:48:07] k [14:48:20] it finished yesterday night, and I started everything [14:48:36] Only breakage so far is AppsSessionMetricsGlobal [14:48:54] And uniques monthly still running ( but it could have been expected) [14:51:11] K COOL [14:51:13] WHOA CAPS [14:53:18] ottomata: Since it's still babysitting day, let's go for the cassandra one [14:53:26] hey yall, moritzm wants to reboot eventlog1001 [14:53:32] i'm going to stop eventlogging and let him do so [14:53:41] k [14:53:42] s'ok a-team? [14:53:45] mforns: ok? [14:54:01] ok ottomata [14:54:18] sok with me [14:57:28] !log stopping eventlogging to reboot eventlog1001 for kernel update [14:59:35] joal: cassandra one? [14:59:40] yessir [15:00:00] Ummmm not remembering [15:00:12] resetbase, from gwicke [15:00:20] oH [15:00:21] right [15:00:22] haha [15:00:24] was thikning oozie job [15:00:25] thanks! [15:00:26] yes [15:04:16] joal: applied [15:04:19] can you restart restbase? [15:04:30] k ottomata, then we need to wait for a puppet run, correct ? [15:05:01] already done [15:05:08] great ottomata [15:05:24] Then, only thing is to restart restbase I think [15:05:24] hm one sec [15:05:27] i think 1001 doesn't have it applied [15:05:31] k [15:05:37] cmoon puppet [15:05:50] thar we go [15:05:53] ok good to restart [15:06:09] ottomata: one by one please :) [15:06:18] And, please wait a minute [15:06:22] ottomata: --^ [15:06:44] ottomata: GO ! [15:06:50] uhhh [15:06:55] do i just do service restbase restart? [15:07:07] hm, I think so [15:07:08] joal: ? [15:07:10] haha [15:07:10] k [15:07:12] milimetric: confirm? [15:07:18] I don't know if cassandra needs a restart [15:07:35] or if the read local applies in restbase only [15:09:39] ha, me neither! [15:09:44] would like to have some more info [15:09:49] ottomata: currently reading on the topic [15:09:50] gwicke: only mentinoed restbase restart yesterday [15:09:55] hm, i will ask urandom [15:10:09] then restbase restart, that is good (and whatI'd have expected) [15:12:34] joal: I've been reading up I'm not seeing how this relates to restbase [15:12:35] ottomata: restbase restart is sure, my wodner is about cassandra [15:12:42] Analytics-EventLogging, DBA: Potentially decrease db1046's InnoDB buffer pool - https://phabricator.wikimedia.org/T125829#1998057 (jcrespo) NEW a:jcrespo [15:13:10] joal: urandom confirms, doing restart one at a time [15:13:22] awesome :) [15:13:24] no idea what you guys are mumbling about [15:13:29] :) [15:13:39] !log restarting aqs restbase 1 node at a time [15:13:45] Lowring read consistency level of restbase on cassandra milimetric [15:14:05] looks ok [15:14:10] how long should i wait before I do the next one? [15:14:11] * milimetric is even more lost now [15:14:15] huhu [15:14:25] * milimetric feels like he missed some IRC messages [15:14:36] ottomata: a few seconds should be ok I think [15:15:01] ottomata: currently monitoring latency charts [15:15:55] Charts are not really up-to-date, so difficult to follow :) [15:17:13] k just restarted 1002 [15:17:21] looks ok [15:18:07] ottomata: it's quiet time for them (just disk seek), so should be ok [15:19:01] ok, just restarted on 1003 [15:19:08] also looks fine [15:19:09] cool [15:19:15] great [15:23:43] oh by the way ottomata : I tested hive MSCK REPAIR yesterday: works like a charm to restore partitions in metastore from folder hierarchy ! [15:25:40] milimetric, is there any existing dashiki config wiki page that uses the textual format? I'm reviewing your patch [15:25:42] awesome! for what table? [15:25:50] as long as it has the partition keys in the folder names, ja? [15:25:55] ottomata: mmmm so what do we need to do to restart EL hosts? anthing that might go in https://wikitech.wikimedia.org/wiki/Service_restarts ? [15:26:00] mforns: yes, I just changed Config:TestTabs to have it [15:26:06] cool thx! [15:26:17] aqs hourly ottomata : I had a table in my db, therefore created one in wmf db, moved the files, and MSCK REPAIR --> done in minutes ! [15:26:25] mforns: also though the default config has it, if you navigate to src/layouts/tabs/ [15:26:37] I see milimetric [15:26:45] elukey: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall#Restart_EventLogging [15:26:47] mforns: the build part of this layout is broken still, it's very painful [15:26:51] although, i'm going to edit that [15:26:53] i don't think restart works [15:27:09] edited. [15:27:11] stop && start [15:27:11] milimetric, it worked for me... [15:27:21] it doesn't copy the new fonts from semantic2 [15:27:30] I see [15:27:33] I was trying to fix it last night and I'm still trying [15:27:48] elukey: if you edit service_restarts, maybe just link there, because it is probably going to change if we upgrade to jessie and use systemd [15:28:02] haha [15:28:04] its here twice already! [15:28:05] https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall#Restart_EventLogging_2 [15:28:18] ha, did I add that? [15:28:44] I consolidated the page a while ago from some snippets found here and there [15:28:52] nice [15:28:53] I might have done it without knowing it [15:29:19] I am always scared when we talk about stopping EL [15:29:34] is there any risk of loosing data or is it completely fine? [15:29:40] just to know what NOT to do [15:30:35] ja, there could be a little loss [15:30:42] but mostly not [15:30:57] usually queued messages will be inserted into mysql before it fully stops [15:30:59] (PS3) Milimetric: Add textual table "visualization" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) [15:31:07] and the next time it starts, it picks up from previous offsets [15:31:08] mforns: ok, new patch up, all works now [15:31:12] which causes more duplicates than loss [15:31:24] * mforns looks [15:33:24] mforns: if you set the visualizer type of the Sessions metric to table-timeseries you'll see it handles that data being HUGE by setting the filter to apply only when you press enter [15:34:04] milimetric, aha [15:38:35] Analytics-Tech-community-metrics, pywikibot-core, DevRel-February-2016, Upstream: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1998117 (Lcanasdiaz) Half of the work is done. I'm going to work on the visualization of t... [15:40:24] ahhh yes the batch inserts [15:48:01] (PS4) Milimetric: Add textual table "visualization" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) [15:48:14] mforns: I forgot I had made some modifications to the timeseries data core, so I added some tests ^ [15:48:28] milimetric, awesome [16:03:12] Analytics-Kanban, RESTBase: Update AQS config to new config format {melc} [5 pts] - https://phabricator.wikimedia.org/T122249#1998202 (Milimetric) [16:03:43] joal: whenever you have time can we chat about https://phabricator.wikimedia.org/T125199 ? [16:04:04] (brb) [16:08:10] Hey elukey [16:08:34] I'm ignorant in SMART (which makes sense) [16:08:48] We can chat if you want but I don't think I can help [16:09:16] yessss I just wanted to talk about the next steps [16:09:26] do you have a minute in the batcave? [16:09:31] sure [16:19:20] Analytics-Tech-community-metrics, pywikibot-core, DevRel-February-2016, Upstream: Statistics for SCM project 'core' mix pywikibot/core, mediawiki/core and oojs/core - https://phabricator.wikimedia.org/T123808#1998274 (Lcanasdiaz) Dropdown widget fixed. Ready to be deployed. https://github.com/VizG... [16:20:56] (CR) Mforns: [C: -1] "Everything LGTM, but somehow I can not download the larger datasets. See comments. If you think this is not too important, I'm OK with mer" (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) (owner: Milimetric) [16:32:55] Analytics-Kanban: Rotate kafka GC logs [3 pts] {hawk} - https://phabricator.wikimedia.org/T124644#1998318 (Nuria) Open>Resolved [16:33:11] Analytics-Kanban: Debug wikimetrics docker dev setup failing on ubuntu 14.04 - https://phabricator.wikimedia.org/T125415#1998320 (Nuria) Open>Resolved [16:34:31] Analytics-Kanban, Editing-Analysis, Patch-For-Review: Queries for the edit analysis dashboard failing since December 2015 [5 pts] - https://phabricator.wikimedia.org/T124383#1998324 (Nuria) Open>Resolved [16:35:08] Analytics-Kanban, Editing-Analysis: Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts] - https://phabricator.wikimedia.org/T124676#1998326 (Nuria) Also, please see: https://phabricator.wikimedia.org/T124383 When issues with size are fixed these jobs can... [16:36:26] Analytics-Kanban, DBA, Patch-For-Review: Pending maintenance on the eventlogging databases (db1046, db1047, dbstore1002, other dbstores) - https://phabricator.wikimedia.org/T120187#1998327 (Nuria) Open>Resolved [16:37:40] Analytics-Kanban, Patch-For-Review: Create a dedicated hive table with pageview API only requests for reporting [5 pts] {melc} - https://phabricator.wikimedia.org/T118938#1998329 (Nuria) Open>Resolved [16:38:06] Analytics-Kanban, Patch-For-Review: Create a dedicated hive table with pageview API only requests for reporting [5 pts] {melc} - https://phabricator.wikimedia.org/T118938#1813225 (Nuria) Resolved>Open [16:56:59] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, Patch-For-Review: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#1998395 (Sadads) @Legoktm: did the analysis help? I think adding the feature desirable but not... [17:00:34] a-team joining standup in a second. [17:00:36] a-team: standduppp [17:01:04] coming! [17:01:43] ottomata: holaaaa [17:02:42] EEE [17:24:09] (PS5) Milimetric: Add textual table "visualization" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) [18:02:19] Analytics-Kanban: Remove cron on wikimetrics instance that updates vital signs [1 pts] - https://phabricator.wikimedia.org/T125751#1998780 (Milimetric) [18:03:16] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Create and maintain an Analytics Cluster in Beta Cluster in labs. [21 pts] - https://phabricator.wikimedia.org/T109859#1998782 (Ottomata) [18:03:50] Analytics-Cluster, Analytics-Kanban, EventBus, Patch-For-Review: Camus job to import mediawiki.* eventbus data to Hadoop. [8 pts] - https://phabricator.wikimedia.org/T125144#1998783 (Ottomata) [18:04:37] Analytics, Analytics-Cluster, hardware-requests, operations: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1998793 (Milimetric) [18:05:15] Analytics, hardware-requests, operations, Patch-For-Review: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1998797 (Milimetric) [18:06:23] Analytics-Kanban: Bookmark-able graphs in Dashiki tabular layout [3 pts] {lama} - https://phabricator.wikimedia.org/T124298#1998807 (Milimetric) [18:07:00] Analytics: analytics specific icinga alerts should ping in our IRC channel. - https://phabricator.wikimedia.org/T125128#1998809 (Milimetric) [18:07:17] Analytics: analytics specific icinga alerts should ping in our IRC channel. - https://phabricator.wikimedia.org/T125128#1998811 (Milimetric) p:Triage>Normal [18:07:43] Analytics-Kanban: Lower parallelization on EventLogging to 1 consumer {oryx} - https://phabricator.wikimedia.org/T125225#1998813 (Milimetric) [18:08:20] Analytics: Add regexps that match the bots that follow the User-Agent policy {hawk} - https://phabricator.wikimedia.org/T125731#1998819 (Milimetric) p:Triage>Normal [18:08:43] Analytics: Add regexps that match the bots that follow the User-Agent policy {hawk} - https://phabricator.wikimedia.org/T125731#1996003 (Milimetric) research whether the new expressions that we're adding have a match on the cluster [18:09:34] Analytics, Analytics-Cluster, operations: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1998824 (Milimetric) Open>stalled No actionables. If we install we'll use RAID [18:09:59] Analytics, Analytics-Cluster, operations: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1998826 (Milimetric) stalled>Resolved a:Milimetric [18:10:50] Analytics: Data integrity on Analytics Kafka nodes - https://phabricator.wikimedia.org/T125650#1998840 (Milimetric) Open>Resolved Kind of the same as T99105, if we move to RAID that'll probably solve this [18:11:16] Analytics-Kanban, Wikipedia-Android-App: Database not updated for beta event logging and all-events.log reports 8x for each event [3 pts] - https://phabricator.wikimedia.org/T125423#1998848 (Milimetric) [18:11:27] Analytics: Upgrade Dashiki to semantic-2 for all layouts - https://phabricator.wikimedia.org/T125409#1998851 (Milimetric) [18:11:44] Analytics: Upgrade Dashiki to semantic-2 for all layouts - https://phabricator.wikimedia.org/T125409#1986826 (Milimetric) p:Normal>Low [18:14:57] Analytics: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#1998905 (Milimetric) [18:15:07] Analytics: Clean up datasets.wikimedia.org - https://phabricator.wikimedia.org/T125854#1998907 (Milimetric) p:Triage>Low [18:18:39] Analytics: Pageview API: Better filtering of bot traffic on top enpoints - https://phabricator.wikimedia.org/T123442#1998927 (Milimetric) [18:18:41] Analytics: Pageview API should filter artificial traffic - https://phabricator.wikimedia.org/T125361#1998926 (Milimetric) [18:20:18] Analytics-Kanban: Eventlogging should start with one bad kafka broker, retest that is the case - https://phabricator.wikimedia.org/T125228#1998946 (Milimetric) [18:20:20] Analytics: Handle EventLogging's pykafka connection errors gracefully {oryx} - https://phabricator.wikimedia.org/T125207#1998945 (Milimetric) [18:21:00] Analytics, Editing-Analysis, Editing-Department: Consider scrapping Schema:PageContentSaveComplete and Schema:NewEditorEdit, given we have Schema:Edit - https://phabricator.wikimedia.org/T123958#1998956 (Milimetric) a:Milimetric>None [18:21:44] Analytics, Analytics-Wikimetrics: Some special characters break Wikimetrics' encoding {dove} - https://phabricator.wikimedia.org/T114884#1998961 (Milimetric) p:Triage>Normal [18:22:18] Analytics, Analytics-Cluster: Remove refinery-hive.jar from hive-site.xml - https://phabricator.wikimedia.org/T114769#1998963 (Milimetric) p:Triage>Normal [18:24:01] Analytics: Spike: Can we have a production Event Logging endpoint from labs? - https://phabricator.wikimedia.org/T114503#1999002 (Milimetric) Open>Resolved a:Milimetric DONE YAYYAAY!! :) [18:26:04] Analytics: Visualization of Zika access by geography - https://phabricator.wikimedia.org/T125856#1999014 (Milimetric) NEW a:Milimetric [18:27:48] Analytics: Investigate US traffic by state normalized by population - https://phabricator.wikimedia.org/T114469#1999029 (Milimetric) p:Triage>Normal [18:28:36] Analytics, Documentation: Clean up and possibly refine UDP sampled logs (which go back to 2014) - https://phabricator.wikimedia.org/T114381#1999039 (Milimetric) p:Triage>Normal [18:29:18] Analytics: Notify potential users of the UDP sampled logs that we're preparing to purge them - https://phabricator.wikimedia.org/T114380#1999045 (Milimetric) [18:29:20] Analytics, Documentation: Clean up and possibly refine UDP sampled logs (which go back to 2014) - https://phabricator.wikimedia.org/T114381#1693705 (Milimetric) [18:32:11] holy spam :) sorry all - grooming is real [18:33:04] mforns: in case you have energy left today, I passed the download link instead of re-processing the data as text [18:33:16] so now it downloads ok but the filename is not pretty any more [18:33:44] which is the best I can do without changing the file itself, I think [18:53:08] a-team, I'm off for tonight ! [18:53:12] See you tomorrow :) [18:55:50] bbbbyyyeeeee [18:58:14] milimetric: Hi. Does pageview_hourly include pageviews for special pages too? [18:58:17] * MarkTraceur looks around for halfak [18:58:51] I had a question about the xmldatadumps, if anyone has any thoughts [18:59:27] bye joal! [19:00:29] The rough idea is that I want to import data from them, via a Python script and mysqlimport, into tables on stat1003. I'm doing it manually now. Then I want to run reportupdater on the data I've imported, and those queries are already in review (but I mentioned in the commit message that it should probably wait a bit)...is there any way to run a script after [19:00:29] each dump is done? Is there a way to run reportupdater scripts after each dump is imported? [19:00:59] Oh, and is there a way to get revision data for only those revisions not included in past dumps [19:01:41] halfak told me to use -pages-meta-history*.xml*.bz2 but maybe there's a different set of files that would serve me better [19:10:31] madhuvishy: I am going offline in a bit, tomorrow I'll try to take a look to the Go template for Burrow.. BUT if you want to do it today, let me know! [19:11:11] elukey: it should be easy enough - just adding a file to the puppet module - not sure i'll have time today - we can pair on it tomorrow morning if you want [19:14:58] sure! [19:15:29] Analytics-Tech-community-metrics, DevRel-February-2016: What is contributors.html for, in contrast to who_contributes_code.html and sc[m,r]-contributors.html and top-contributors.html? - https://phabricator.wikimedia.org/T118522#1999309 (Aklapper) @Lcanasdiaz: Do you think that it should be in upstream? :) [19:16:37] elukey: okay :) [19:35:25] milimetric, yt? [19:38:54] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267993 (https://phabricator.wikimedia.org/T124297) (owner: Milimetric) [19:51:58] ottomata: does this look right to you to run the el processor [19:52:00] bin/eventlogging-processor "%q %{recvFrom}s %{seqId}d %t %h %{userAgent}i" "kafka:///kafka203:9092?zookeeper_connect=conf100.analytics.eqiad.wmflabs:2181&topic=el-test&auto_offset_reset=-2" "file:///home/madhuvishy/out.txt" [19:53:42] not quite [19:53:49] zookeeper_connect=conf100.analytics.eqiad.wmflabs:2181/kafka/... [19:53:50] something [19:54:00] whatever it says in kafka203 in /etc/kafka/server.properties [19:54:08] ah okay [19:57:17] ottomata: it was throwing NoPartitionsForConsumerException - which dint make any sense until you pointed out that zk url was wrong [20:47:13] ottomata: as far as i tried - eventlogging processor rebalances itself if out of a cluster of 3 - 1 or 2 kafka brokers go down - it dies when all three are down which makes sense [20:47:33] right [20:47:39] madhuvishy: that is what I saw too when I tested [20:47:45] madhuvishy: even if you take the first broker out of the list? [20:47:57] what happens if instead of stopping kafka, you shutdown the labs instance [20:47:58] or [20:48:06] use iptables to block all incoming connections on that port [20:48:11] block/ignore [20:48:16] ottomata: as in - pass only one broker to the url? [20:48:20] n [20:48:20] o [20:48:25] as in, pass 3, but stop the first one [20:48:33] the issue we saw in prod happened when kafka1012 died [20:48:37] and itis the first broker in the list [20:48:54] ya i stopped the first one [20:48:56] it was fine [20:49:16] let me try the other edge cases [20:56:05] bmansurov: yes [20:56:17] thanks [20:56:19] pageview_hourly includes pageviews for some special pages [20:56:34] ok [20:57:22] bmansurov: a really simple check is to look at the pageview API, which is fed from pageview_hourly, and as you can see: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Special:Search/daily/2015100100/2015103000 [20:57:54] milimetric: great, thanks for the tip [20:58:03] mforns: hey [20:58:32] hey milimetric I had a question, but found the answer myself, nm! [20:58:43] merged your patch [20:59:56] (CR) Milimetric: [C: 2 V: 2] "self-approving it's just a tiny simple change" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267910 (owner: Milimetric) [21:00:28] (CR) Milimetric: [C: 2 V: 2] "self-approving 'cause it's just a clean-up" [analytics/dashiki] - https://gerrit.wikimedia.org/r/267389 (owner: Milimetric) [21:02:50] (those are all on a chain, once the tabular layout gets merged they'll all auto-merge) [21:03:08] madhuvishy: ja [21:03:12] i expect them to work [21:03:15] cause thats what I tested! [21:03:16] :) [21:03:18] but [21:03:21] i did not test stopping the whole box [21:03:29] i just tested stopping hte kafka broker process [21:03:30] lets try to break it a bit more :) [21:03:35] do the iptables thing [21:03:54] you want drop [21:04:07] http://www.cyberciti.biz/faq/iptables-block-port/ [21:04:28] iptables -A INPUT -p tcp --destination-port 9092 -j DROP [21:05:26] run that on the first broker while looking at EL logs [21:05:30] and then also maybe restart the EL process [21:12:14] ottomata: okay trying now [21:16:49] ottomata: some weirdnesss [21:17:42] ottomata: when i dropped the tables - the process dint do anything - just hung there for a while. So i killed and restarted the processor - and even if i produce to another broker in the cluster [21:17:55] it is hung on [21:18:02] https://www.irccloud.com/pastebin/2mIoZk7C/ [21:18:37] huH! [21:18:43] very similar to the mediawiki problem!!! :) [21:18:49] iintteresting [21:18:50] https://www.irccloud.com/pastebin/ZSSf3EUP/ [21:18:58] madhuvishy: can you find out what just a producer does in this case? [21:19:00] ottomata: now this happened [21:19:02] consume from stdin or osmething [21:19:08] madhuvishy: back, let me know if you want me to show you my changes for cron on wikimetrics instance so you know where things are [21:19:15] who [21:19:16] a [21:19:17] that's strange [21:19:20] zk on kafkaxxx? [21:19:24] is your url correct? [21:19:31] ottomata: i think so [21:19:53] bin/eventlogging-processor "%q %{recvFrom}s %{seqId}d %t %h %{userAgent}i" "kafka:///kafka203:9092,kafka200:9092,kafka201:9092?zookeeper_connect=conf100.analytics.eqiad.wmflabs/kafka/analytics-analytics&topic=el-test&auto_offset_reset=-2" [21:19:53] "kafka:///kafka203:9092,kafka200:9092,kafka201:9092?zookeeper_connect=conf100.analytics.eqiad.wmflabs/kafka/analytics-analytics&topic=el-test2" [21:20:12] ottomata: ^ [21:21:30] ya looks good [21:21:31] that is funky [21:21:37] why would it try to use the kafka brokers as zk connect? [21:22:03] madhuvishy: in the first case, when kafka203 timed out [21:22:12] ottomata: I dont know! it gave the same error when i killed all three brokers one by one [21:22:12] it did nothing during that time? [21:22:16] how long did that take? [21:22:20] oh ok [21:22:22] then that is probably fine [21:22:32] its ok if it prints that, as long as it keeps working [21:23:07] ottomata: in both cases - it died - but i meant at the end of killing all three brokers - not in between each step [21:23:22] it died now too - timed out and died [21:24:02] ah [21:24:10] ok confused [21:24:16] so, all 3 brokers up [21:24:18] things working [21:24:20] then [21:24:28] iptables -DROP [21:24:31] then what happens? [21:24:34] does it keep working? [21:24:40] iptables -DROP on just kafka203 [21:25:00] ottomata: no [21:25:05] it just hung there [21:25:20] then i killed and restarted, it gave the errors i pasted [21:25:22] and died [21:25:36] i restarted again now [21:25:44] it did [21:25:48] https://www.irccloud.com/pastebin/HXrX9vdB/ [21:26:21] i'm confused too [21:26:24] is it working when it does that? [21:26:35] when you restart it [21:26:36] ? [21:26:41] or does it hang for 30 seconds doing nothing? [21:26:48] does it start to work after 30 seconds? [21:26:57] looks like it, yes? [21:27:35] ottomata: hmmmm - i'm a bit confused as to what it's doing - we can batcave - or i can repeat what i did again to make sure [21:28:07] gimme a few mins then ja [21:30:38] ok madhuvishy les batcave [21:30:42] my stuff is not WORKGIINGNGN [21:30:43] GRRR [21:30:43] :) [21:31:20] ottomata: joining [21:40:24] https://www.irccloud.com/pastebin/Q8PfvkQs/ [21:40:42] Analytics: Make sunburst and stacked-bars resize with window {crow} [3 pts] - https://phabricator.wikimedia.org/T114162#2000056 (Milimetric) [21:41:07] Analytics: Make sunburst and stacked-bars resize with window {crow} [3 pts] - https://phabricator.wikimedia.org/T114162#2000062 (Milimetric) p:Triage>Normal [21:41:15] Analytics: Cassandra Backfill July [5 pts] {melc} - https://phabricator.wikimedia.org/T119863#2000063 (Milimetric) p:Triage>Normal [21:41:23] &socket_timeout_ms=1000 [21:58:37] bin/eventlogging-processor "%q %{recvFrom}s %{seqId}d %t %h %{userAgent}i" "kafka:///kafka203:9092,kafka200:9092,kafka201:9092?zookeeper_connect=conf100.analytics.eqiad.wmflabs/kafka/analytics-analytics&topic=el-test&socket_timeout_ms=1000" stdout:// [21:58:45] ottomata: [22:02:55] Analytics-Kanban, RESTBase, Patch-For-Review: Update AQS config to new config format {melc} [5 pts] - https://phabricator.wikimedia.org/T122249#2000186 (mobrovac) We have just pushed out a change for #RESTBase (v0.10.0) and #service-runner (v1.1.1) which will require another small config change. I'll c... [22:05:36] (CR) Nuria: Functions for categorizing queries. (9 comments) [analytics/refinery/source] - https://gerrit.wikimedia.org/r/254461 (https://phabricator.wikimedia.org/T118218) (owner: Bearloga) [22:06:32] * nuria_ secretly hoping those events appearing multiple times in beta labs have sorted themselves [22:13:49] milimetric: i'm guessing you plan to deploy soon given the config change upload? [22:14:08] mobrovac: yeah, i was just catching up on the changes y'all made [22:14:21] oh i forgot you're all in SF [22:14:29] I was waiting to tomorrow morning to ask you :) [22:14:35] you should probably add the utf-8 enconding to the headers [22:14:37] cf https://lists.wikimedia.org/pipermail/analytics/2016-February/004908.html [22:14:42] i saw that, yea [22:14:44] milimetric: hehe, yup, still here [22:15:07] cool, i'll finish something up and add those headers, but is there anything else we need? [22:15:17] is referencing the aqs_default.yaml config still ok? [22:16:46] yup [22:16:51] nice job milimetric! [22:18:22] Analytics-EventLogging, MediaWiki-extensions-MultimediaViewer: 60% of MultimediaViewerNetworkPerformance events dropped (exceeds maxUrlSize) - https://phabricator.wikimedia.org/T113364#2000254 (Tgr) It's temporarily broken due to T91465 (a little bit less temporarily than expected, due to all the train s... [22:20:58] madhuvishy: let's talk about burrow/cron when you are free, i am going to look at mysql on eventlogging [22:29:36] nuria_: dunno what's going on with burrow, but i think its fine [22:29:43] curl http://krypton.eqiad.wmnet:8000/v2/kafka/eqiad/consumer/mysql-m4-master/status | jq . [22:30:02] ottomata: ya, i was looking arround in mysql and could not find obvious issues from graphana [22:30:15] ottomata: let me look at db to see timestamps of events being inserted now [22:32:34] nuria_: i think burrow is noticing a small amount of lag for some reason, and reporting, but it is intermittent [22:40:15] nuria_: ottomata ya - the lag if you notice has dropped from ~30 in the morning to ~17 now [22:40:29] the number of lagging partitions is also 2 now [22:40:33] gone down by 1 [22:40:45] madhuvishy: ...is the lag teh 3rd number reported ? [22:40:50] nuria_: yeah [22:41:02] k [22:41:23] ? [22:41:26] oh [22:41:27] yes [22:41:33] and i htink that is offsets, right? [22:41:39] which is messages, not many [22:41:46] ottomata: timestamp, offsets, lag [22:42:00] lag is in offsets though, ja? [22:42:17] like, 17 messages behind [22:42:18] ottomata: oh yes [22:42:20] ja [22:42:21] yes [22:42:59] ottomata: may be if the mysql consumer is bound to be slow we could make it more permissible [22:43:58] madhuvishy: ya cause 17 messages on an inflow of 200 per sec seems quite reasonable [22:45:15] madhuvishy: i can make it more lenient , are those al;arms in puppet? [22:45:40] nuria_: in puppet yes - but not "alarms" per say [22:45:47] https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules [22:47:52] nuria_: If the consumer offset does not change over the window, and the lag is either fixed or increasing, the consumer is in an ERROR state, and the partition is marked as STALLED. [22:47:58] this is what happened i think [22:48:15] (1454622892611, 199315032, 17) -> (1454622901672, 199315032, 17) [22:48:21] offset hasn't changed [22:48:24] over 10 minutes [22:48:25] madhuvishy: k, added docs here: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall#Burrow [22:48:30] that's our window [22:48:38] and lag is fixed [22:49:03] i'm not sure if nothing from that partition is being consumed yet [22:49:06] ottomata: ^ [22:50:54] ahhh [22:50:57] that makes so much sense [22:51:04] our batch time is what, 5 minutes? [22:51:18] hmmm [22:51:19] actually [22:51:21] no it doesn't [22:51:24] i take it back [22:51:25] ottomata: i think so - but we changed it? [22:51:29] beacuse its consuming from valid-mixed [22:51:36] was gonna suggest that there was just little data for a schema [22:51:40] ottomata: https://github.com/wikimedia/eventlogging/blob/master/README.md is really great, by the way [22:51:40] but, that's wrong [22:51:41] right [22:51:44] there is def data in valid-mixed [22:51:47] thanks! [22:51:58] but not all partitions are erroring [22:52:07] ori also this is new: https://wikitech.wikimedia.org/wiki/EventBus [22:52:16] would some partitions not have data for some reason? [22:52:17] wip by eevans and I [22:52:59] i don't think we key messages like that, and if we did, i would think there's enough variation in keys to not have particular partition slow [22:53:08] ottomata: either that.. or... offsets are not being commited for those partitions [22:53:18] i dunno [22:53:48] right [22:55:06] or its just a little slow - and we should increase our lagcheck interval [22:55:22] mobrovac: small update for https://github.com/wikimedia/restbase/pull/501 [22:56:08] madhuvishy: no offsets have to be comitted or we would be getting repeated larms right? [22:56:11] *alarms [22:56:23] neat! [22:56:32] nuria_: yes - they are just being a bit slow for some reason i think [22:56:36] milimetric: will merge as soon as travis confirms [22:56:36] madhuvishy: cause the two error alarms are for different partitions [22:56:44] nuria_: yup [22:58:34] nuria_: The lagcheck intervals configuration determines the length of our sliding window. This specifies the number of offsets to store for each partition that a consumer group consumes. The default setting is 10 which, when combined with an offset commit interval (configured on the consumer) of 60 seconds, means we evaluate the status of a consumer group [22:58:34] over 10 minutes. [22:58:50] in our case, I believe the offset commit interval is 1 second [22:58:52] so [22:59:15] it evaluates every 10 seconds? [22:59:21] ottomata: ^ [23:00:13] which i guess is too less. we can have it do it every 1 minute or so may be? and increase the number of offsets to store [23:02:17] hmmmmm oh, interesting madhuvishy possibly [23:02:25] im' not sure how burrow would know about how often pykafka is committing though [23:02:36] oh i guess it is just looking at the last 10 commits [23:02:37] ahhhh [23:02:38] yes [23:02:39] then yes [23:02:41] ew shoudl just increase [23:02:43] makes sense [23:02:49] right, last 10 offsets comitted [23:02:58] ottomata: yup [23:03:00] okay [23:03:40] so we can bump it to 60 offsets and see if some of these false alarms go away. 10s is too short a window [23:03:58] i never made the connection that we were committing every second [23:04:03] to have 10 minutes should be 600 offsets [23:04:34] which is teh example given on docs [23:04:36] nuria_: yeah - that just seems really high though - i'm suggesting we go up to 1 minute [23:04:38] *the [23:05:38] madhuvishy: what makes it seem high? [23:06:51] ottomata: that looks great [23:07:48] nuria_: i am not really sure - but - if you look at https://github.com/linkedin/Burrow/wiki/Configuration#lagcheck [23:08:00] there is also the min-distance param [23:08:36] madhuvishy: their default is 10 intervals of offset comits of 60 secs = 10 mins [23:08:51] and i wonder what would happen if we increased the min-distance [23:08:53] yes i know [23:09:26] i dont know if an interval of 600 is a good idea - it might be - it just seems like a really high number [23:10:02] madhuvishy: ok, let's do trial and aerror [23:10:05] *error [23:10:18] yup which is why i thought we could step it up to a minute [23:10:26] madhuvishy: so now we have 10 secs, which is clearly too few as we have false alarms [23:10:26] and if we still get false alarms [23:10:33] we can increase [23:11:11] madhuvishy: let's do the old fashion tried and tested method=> bump up 1 order of magnitude: 100 secs [23:11:21] madhuvishy: and see whether we get too few alarms or too many [23:11:40] sure why not [23:11:48] madhuvishy: so making intervals=100 shoudl do it [23:11:50] hello! who could I talk to about the new pageview API demo? https://analytics.wmflabs.org/demo/pageview-api/ [23:12:01] MusikAnimal: anyone here really [23:12:04] it is very nice, but does not support deeplinking :( [23:12:10] i'm out yallllLlLll laters! [23:12:16] MusikAnimal: no, cause it is a showcase [23:12:21] MusikAnimal: not a tool [23:12:23] ottomata: byeee [23:12:36] MusikAnimal: it is mean as a way for people to cut and paste code and get started with api [23:12:44] MusikAnimal: fancier tools are coming [23:12:47] yes, the code. Where is it? ha [23:13:37] James_F informed me there is an official tool coming [23:13:41] so I guess I should just wait? [23:13:49] nuria_: it is here - https://github.com/wikimedia/operations-puppet/blob/production/modules/burrow/templates/burrow.cfg.erb. I think we should make the interval configurable though. Do you want to do it, if not I can [23:14:34] madhuvishy: no, it's my ops week , i will do do it now [23:14:42] For context, https://meta.wikimedia.org/wiki/Community_Tech/Pageview_stats_tool is being built/supported by Community Tech. [23:14:54] nuria_: okay let me know if you need anything :) [23:15:07] MusikAnimal: right, a tool is being build by community folks [23:15:09] nuria_: as for the cron - I'm sure you'll take care of it - no worries! [23:15:50] madhuvishy: jaja, but you are want to know how can you bypass the fact that it should be root running that cron (as it teh depot is checked out with sudo) [23:16:00] madhuvishy: not exactly obvious [23:16:09] aha [23:16:23] how does that work? [23:16:46] MusikAnimal: https://gist.github.com/marcelrf/49738d14116fd547fe6d#file-article-comparison-html i think this is it [23:16:57] (code for demo) [23:17:11] madhuvishy: you add to the sudoers the ability for some users (that already have sudo ) [23:17:23] madhuvishy: to execute commands w/o being prompted by a password [23:20:07] nuria_: aah [23:25:41] like: [23:25:56] nuria ALL=NOPASSWD: somecomand [23:26:18] on /etc/sudoers.d/ will add this to ticket [23:26:30] cc madhuvishy [23:26:37] nuria_: mm hmm okay [23:47:24] madhuvishy: did you had time this morning to look at last access numbers with joseph? [23:49:50] nuria_: no.. We never got to it [23:50:04] madhuvishy: k, i am going to take a look out of curiosity