[08:59:43] hello! [08:59:50] I was checking kafka1012 [08:59:52] /dev/sdf1 1.8T 1.5T 322G 83% /var/spool/kafka/f [08:59:58] --^ is the new disk [09:00:09] the new disk is already 83% full? [09:00:23] /o\ [09:00:27] root@kafka1012:/var/spool/kafka/f/data# du -h [09:00:27] 951M ./webrequest_mobile-10 [09:00:27] 831G ./webrequest_text-1 [09:00:28] 124K ./__consumer_offsets-22 [09:00:28] 5.6G ./eventlogging-valid-mixed-1 [09:00:30] 673G ./webrequest_upload-7 [09:00:32] 4.0K ./eventlogging_MobileWikiAppBannerClickThrough-0 [09:00:35] 171M ./eventlogging_WikipediaPortal-0 [09:00:37] 562M ./eventlogging_MobileWikiAppEdit-0 [09:00:40] 652M ./eventlogging_TestSearchSatisfaction2-0 [09:00:42] 212K ./__consumer_offsets-24 [09:00:45] 1018M ./webrequest_misc-0 [09:00:47] 159M ./eventlogging_ImageMetricsLoadingTime-0 [09:03:13] hm [09:04:22] elukey: We need to talk to ottomata about EL message keys --> kafka partitions messages based of the message key, and IIRC on EL the key is schema based [09:06:18] joal: yep I've heard you guys talking about it [09:06:45] elukey: I'm sure this is the thing, but it might [09:07:42] elukey: Can you tell me more about eventlogging-valid-mixed-??? partitions on other disks (or other machines?) [09:10:22] joal: I am missing something, sorry for the extra question. I am seeing webrequest_text/upload as major contributors for the disk saturation, so I thought it was more an imbalance due to varnishkafka [09:10:35] or possibly due to the kafka broker catching up with too much data [09:11:21] elukey: Yes you're absolutely right ! [09:11:24] My bad [09:12:04] ahhhhh okkkk! [09:12:16] elukey: last week webrequest_mobile has been merged into webrequest_text [09:13:02] the load was therefore shared among 12 partitions (6 for mobile, 6 for text) [09:13:12] Now, the load is shared among 6 partitions [09:13:17] Creating issues. [09:14:09] elukey: I had discussed with ottomata already about the number of partitions being too small when merging everything in text, but we said we'd wait and see --> I think we are gonna take actions :) [09:16:28] thanks for the clarification! but theoretically webrequest_mobile should go away right [09:16:31] ? [09:16:36] without it we are basically inline [09:16:50] good point in checking the other brokers though [09:17:04] webrequest_mobile is "away' means all it's trafiic is now handled by webrequest_text [09:17:04] I am restarting hhvm atm, going to check them in a bit [09:17:37] joal: yes, I meant that 951MI ./webrequest_mobile-10 will not be needed anymore [09:18:00] correct elukey, it's in kafka but don't even import it in camus anymore [09:18:06] \o/ [09:18:18] I am getting something right once in a while [09:18:28] all due to your patience joal :D [09:20:07] narf, I'm not patient, I'm thinking aloud :) [09:20:24] And you get it right :) [09:22:31] elukey@neodymium:~$ sudo salt kafka* cmd.run 'df -h | egrep "[6789].%"' [09:22:34] kafka1013.eqiad.wmnet: [09:22:36] kafka1012.eqiad.wmnet: /dev/sdb3 1.8T 1.5T 271G 86% /var/spool/kafka/b /dev/sdf1 1.8T 1.5T 320G 83% /var/spool/kafka/f [09:22:39] kafka1014.eqiad.wmnet: [09:22:42] kafka1018.eqiad.wmnet: /dev/sdg1 1.8T 1.1T 727G 61% /var/spool/kafka/g /dev/sdj1 1.8T 1.1T 729G 61% /var/spool/kafka/j /dev/sdb1 1.8T 1.2T 694G 63% /var/spool/kafka/b [09:22:46] kafka1020.eqiad.wmnet: [09:22:48] kafka1022.eqiad.wmnet: /dev/sdb3 1.8T 1.1T 684G 63% /var/spool/kafka/b /dev/sdi1 1.8T 1.1T 713G 62% /var/spool/kafka/i /dev/sdk1 1.8T 1.2T 703G 62% /var/spool/kafka/k [09:23:04] so only kafka1012.eqiad.wmnet is really a bit overloaded [09:23:08] bad mobile is bad [09:25:31] ? [09:25:59] elukey: not easily readable [09:26:52] elukey: when having multiple lines to show, best practive is to use a paste (in phab, gist or whatever you prefer :) [09:26:58] ahhh sorry, my bad [09:27:01] npo [09:27:28] tl;dr - only kafka1012 has disks with > 80% of space used [09:28:10] hm, kafka 1013? [09:36:22] http://hastebin.com/jupexelobu.rb - better [09:38:24] 1013 seems fine [09:38:37] Ahhh, yes ! [09:38:40] Thanks :) [09:39:10] hm, so 1012 takes a bigger [09:39:16] hhit ... Weird [09:43:26] trying to check if 1012 is the only one with mobile [09:45:46] elukey: mobile is not even 1% of text, don't bother [09:50:51] joal sorry I confused mobile with upload :( [09:50:59] np :) [09:51:07] * joal is away for a bit [10:01:03] Analytics-Tech-community-metrics, Possible-Tech-Projects, Epic: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#2007384 (Kurisutina24) Hello! I would like to work on this project for Outreachy Round 12 and I have also started with the... [10:09:02] * elukey commutes to the office [10:54:19] I am in the office but.. internet is not working well (I am on mobile connection for the moment). I hope to be fully ready soon :( [11:36:51] Analytics-Tech-community-metrics: Microtask: Create a very simple REST API for SortingHat - https://phabricator.wikimedia.org/T114838#2007546 (01tonythomas) >>! In T114838#1971390, @Saylikarnik wrote: > Hello,I am Sayli Karnik ,an Outreachy aspirant for the upcoming Round 12. I am proficient in HTML, CSS, Ja... [11:39:50] all right back :) [11:40:32] Analytics-Tech-community-metrics, Possible-Tech-Projects, Epic: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#2007554 (01tonythomas) >>! In T60585#2007384, @Kurisutina24 wrote: > Hello! I would like to work on this project for Outre... [12:20:55] ottomata: I am sure that https://gerrit.wikimedia.org/r/#/c/268682/13 is wrong, but I tried to refactor it a bit. Please be patient :D [12:23:01] I included also nuria's change [12:45:04] ottomata: we also have a problem with the new disk in kafka1012, namely [12:45:08] /dev/sdf1 1.8T 1.6T 296G 84% /var/spool/kafka/f [12:45:20] let me know when you'll be up and running :) [13:15:51] * elukey grabs lunch [13:31:14] Analytics-Tech-community-metrics: Microtask: Create a very simple REST API for SortingHat - https://phabricator.wikimedia.org/T114838#2007683 (Aklapper) Hi @Saylikarnik. Thanks for your interest! Apart from what @01tonythomas already wrote: As you commented on this task, do you have a [[ https://www.mediawik... [13:33:26] Analytics-Tech-community-metrics, Possible-Tech-Projects, Epic: Allow contributors to update their own details in tech metrics directly - https://phabricator.wikimedia.org/T60585#2007687 (Aklapper) @Kurisutina24: Hi and welcome! Please also check https://www.mediawiki.org/wiki/How_to_become_a_MediaWik... [13:38:19] * elukey back [13:38:34] joal: /away [13:38:40] ? [13:41:27] elukey: wassup ? [13:41:38] sorry Joseph! I wanted to tell you something and put my status not in away [13:41:51] :D [13:42:00] ah, no prob, just wondered :) [13:42:48] anyhow, wanted to tell you that hhvm has been upgraded, but since kafka1012 is still not "green" I think it would be best to wait for andrew before restarting the brokers [13:43:02] so not sure if we'll do the work today [13:43:23] agreed: this disk thing should be sorted before we move forward I guess [13:50:17] Analytics-Tech-community-metrics, DevRel-February-2016: top-contributors.html is not sorted by rank anymore - https://phabricator.wikimedia.org/T125797#2007722 (Aklapper) Open>declined a:Aklapper Cannot reproduce. Will reopen once I manage again. [13:50:20] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#2007725 (Aklapper) [14:01:28] (PS1) Addshore: Fix WikimediaCurl @author tag [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/269120 [14:01:52] (CR) Addshore: [C: 2 V: 2] Fix WikimediaCurl @author tag [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/269120 (owner: Addshore) [14:09:49] Analytics, DBA, WMDE-Analytics-Engineering: labtestwiki appears in the dblist but can not be found on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2007754 (Addshore) NEW [14:17:31] Analytics, DBA, WMDE-Analytics-Engineering: labtestwiki appears in the dblist but can not be found on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2007762 (Krenair) The wiki exists, it's just not hosted by the normal MySQL servers. It's like labswiki which runs on silver onl... [14:17:31] Hey! Is there a tool where I can see the load time stats for enwiki articles? It seems really slow over the past few days. (Not sure if this is the right channel to ask about this :) [14:19:04] Analytics, DBA, WMDE-Analytics-Engineering: labtestwiki appears in the dblist but can not be found on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2007765 (Krenair) See also T89548 which obviously should be dealt with before this [14:21:02] Analytics, DBA, WMDE-Analytics-Engineering: labtestwiki appears in the dblist but can not be found on analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2007769 (Krenair) > I am making an incorrect assumption that dbs on the list should always be replicated to this servers I thin... [14:21:26] (PS1) Addshore: Make minutely wdqs run for each host each min [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/269124 (https://phabricator.wikimedia.org/T126004) [14:21:34] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2007773 (Krenair) [14:22:00] (CR) Addshore: [C: 2 V: 2] Make minutely wdqs run for each host each min [analytics/limn-wikidata-data] - https://gerrit.wikimedia.org/r/269124 (https://phabricator.wikimedia.org/T126004) (owner: Addshore) [14:24:39] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2007779 (Addshore) Thanks for all the info @Krenair and poking this ticket into the correct shape ;) [15:19:35] elukey: good morning! [15:19:43] goooood morning! [15:19:49] y u change the $email_template = 'burrow/email.tmpl'? :) [15:20:57] also, nuria has added lag_window in this patch: https://gerrit.wikimedia.org/r/#/c/268594/6/modules/burrow/manifests/init.pp [15:22:37] not sure I felt that it was belonging to heira with nuria's change, but I can revert :) [15:23:13] so, about module defaults [15:23:17] the module should always have sane defaults [15:23:28] usually, those are what the package or service you are configuring has [15:23:51] most of the time, you want to think of the module as 100% usable outside of the operations/puppet repo [15:23:58] pretend you don't ahve hiera or role classes at all [15:24:06] it should be totally decoupled [15:24:22] ok, I was looking at the problem more like "I want to force people to be aware of those parameters" [15:24:22] (most of the time) someone else should be able to take the module and use it in their own puppet repo if they wanted [15:25:10] naw you want to make it easy to use. those parameters have defaults that will work in most cases. for special cases people can change them if they want, and in those cases they look up how to change them [15:25:28] there may be occasions when you will want to force people to set things [15:25:32] an snap I didn't see https://gerrit.wikimedia.org/r/#/c/268594/6, but only the other one with the value :( [15:26:00] but, not for changing simple defaults like lagcheck_intervals [15:26:03] or the email template [15:26:30] makes sense. [15:26:45] I'll move everything from heira to the module then [15:26:56] but possibily after 268594 [15:27:45] heh, either way one will conflict with the other, and you'll have to resolve in a local rebase [15:27:47] but that's ok [15:28:13] we could also pack everything in mine [15:28:27] naw, easiest to do it this way [15:28:32] conflicts are easy enough to resolve [15:28:44] and its better (even though i am bad at this) to have small commits that to one thing [15:28:53] +! [15:28:56] +1 [15:29:24] all right I'll wait for the code to be merged, then I'll resolve the conflict [15:29:27] :) [15:32:17] hehe, there's no conflict yet [15:32:19] if we merge yours first [15:32:24] nuria will have to resolve it :p [16:50:17] holaaa [16:50:30] Heya [17:00:20] a-team: standddupppppp [17:01:14] ops meetinggggg [17:01:15] sorry! [17:01:26] nuria_: me too! [17:01:36] maybe we shoudl change monday standup time? :)_ [17:06:38] (CR) Milimetric: [C: 2 V: 2] Add note in README about Hiera hostnames config [analytics/dashiki] - https://gerrit.wikimedia.org/r/268829 (owner: Madhuvishy) [17:07:17] (CR) Milimetric: [C: 2 V: 2] Add friendly prints to the fab tasks [analytics/dashiki] - https://gerrit.wikimedia.org/r/268830 (owner: Madhuvishy) [17:09:03] Analytics-Kanban, Patch-For-Review: Buurow Increase length of window to evaluate lag [1 pts] - https://phabricator.wikimedia.org/T125916#2008314 (Nuria) a:Nuria [17:10:44] (CR) Milimetric: [C: -1] Updated result of validation after creating cohort. (1 comment) [analytics/wikimetrics] - https://gerrit.wikimedia.org/r/263911 (owner: Wassan.anmol) [17:14:47] Krinkle: yt? [17:40:26] Analytics-Kanban, Patch-For-Review: Fabric-alize dashiki dashboard deployments {crow} [13 pts] - https://phabricator.wikimedia.org/T110351#2008424 (Milimetric) [17:40:34] a-team: sorry for not having sent the e-scrum but I was fighting with a redis/memcached issue in prod :( [17:40:44] nuria_: afk. Back in 1-2h [17:41:05] Krinkle: ok, we need 15 mins of your time today if that is ok, [17:41:15] elukey, np, you can send it later if you want no? [17:41:43] sure! [17:42:35] Analytics-Kanban: Productionize last access jobs for monthly calculations {bear} [8 pts] - https://phabricator.wikimedia.org/T124678#2008426 (JAllemandou) [17:42:52] nuria_: ok [17:44:23] Analytics-Kanban: Eventlogging should start with one bad kafka broker, retest that is the case {oryx} [5 pts] - https://phabricator.wikimedia.org/T125228#2008439 (Milimetric) [17:45:10] Analytics: Cassandra Backfill July [5 pts] {melc} - https://phabricator.wikimedia.org/T119863#2008448 (Nuria) [17:45:12] Analytics-Kanban: Projections of cost and scaling for pageview API. {hawk} [8 pts] - https://phabricator.wikimedia.org/T116097#2008447 (Nuria) Open>Resolved [17:47:43] Analytics: Consider SSTable bulk loading for AQS imports - https://phabricator.wikimedia.org/T126243#2008467 (Eevans) NEW [17:51:19] Analytics-Kanban: Make Dashiki get pageview data from pageview API {melc} [8 pts] - https://phabricator.wikimedia.org/T124063#2008513 (Milimetric) [17:51:51] Analytics-Kanban, Patch-For-Review: Fabric-alize dashiki dashboard deployments {crow} [13 pts] - https://phabricator.wikimedia.org/T110351#2008514 (Nuria) Open>Resolved [17:57:40] Analytics: Get piwik stats for dashiki - https://phabricator.wikimedia.org/T126247#2008529 (Nuria) p:Triage>Normal [18:00:04] Analytics-Kanban: Have dashiki read and write GET params to pass stateful versions of dashboard pages {crow} - https://phabricator.wikimedia.org/T119996#2008544 (Milimetric) a:Nuria>None [18:00:06] Analytics-Kanban: Have dashiki read and write GET params to pass stateful versions of dashboard pages {crow} - https://phabricator.wikimedia.org/T119996#2008546 (Nuria) [18:00:18] Analytics: Have dashiki read and write GET params to pass stateful versions of dashboard pages {crow} - https://phabricator.wikimedia.org/T119996#1842220 (Nuria) [18:01:01] Analytics, ArchCom-RfC, Discovery, EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2008548 (Milimetric) [18:08:48] nuria_: if you have a minute later on I'd like to talk about https://gerrit.wikimedia.org/r/#/c/268594/6 [18:09:05] elukey: we are on tasking , you can join in if you want [18:09:26] yep I was about to, still working on some ops things :( [18:09:27] elukey: ah, sorry [18:09:40] Analytics-EventLogging, Analytics-Kanban: Add autoincrement id to EventLogging MySQL tables. {oryx} [8 pts] - https://phabricator.wikimedia.org/T125135#2008569 (Milimetric) [18:09:42] elukey: let's talk about it yes, i thought it was your change not mine [18:10:08] Analytics-Kanban: Lower parallelization on EventLogging to 1 consumer {oryx} [3 pts] - https://phabricator.wikimedia.org/T125225#2008571 (Milimetric) [18:10:23] Analytics-Kanban: Lower parallelization on EventLogging to 1 consumer {oryx} [3 pts] - https://phabricator.wikimedia.org/T125225#1981933 (Milimetric) p:High>Unbreak! a:elukey [18:10:42] elukey: we also assigned you one item in tasking, we can talk about it tomorrow [18:10:59] Analytics-Kanban: Lower parallelization on EventLogging to 1 consumer {oryx} [3 pts] - https://phabricator.wikimedia.org/T125225#1981933 (Milimetric) p:Unbreak!>High [18:12:04] Analytics: Consider SSTable bulk loading for AQS imports - https://phabricator.wikimedia.org/T126243#2008591 (Eevans) [18:24:37] Analytics: Get piwik stats for dashiki - https://phabricator.wikimedia.org/T126247#2008673 (Johsthao) [18:24:45] Analytics: Consider SSTable bulk loading for AQS imports - https://phabricator.wikimedia.org/T126243#2008677 (Johsthao) [18:25:10] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2008689 (Johsthao) [18:32:18] Analytics: Get piwik stats for dashiki - https://phabricator.wikimedia.org/T126247#2008781 (matmarex) duplicate>Open [18:32:28] Analytics: Consider SSTable bulk loading for AQS imports - https://phabricator.wikimedia.org/T126243#2008786 (matmarex) duplicate>Open [18:33:04] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2008803 (matmarex) duplicate>Open [18:34:40] Analytics-EventLogging, Analytics-Kanban: Send raw server side events to Kafka using a PHP Kafka Client {oryx} [0 pts] - https://phabricator.wikimedia.org/T106257#2008848 (Milimetric) [18:35:20] Analytics-EventLogging, Analytics-Kanban: Send raw server side events to Kafka using a PHP Kafka Client {oryx} [0 pts] - https://phabricator.wikimedia.org/T106257#2008851 (Nuria) Substaks: This is likely between 21 and 34. Substasks: - make sure we can publish json text with mediawiki mononlog (right... [18:47:09] Analytics: Remove cron on wikimetrics instance that updates vital signs [1 pts] - https://phabricator.wikimedia.org/T125751#2008962 (Nuria) [18:47:58] Analytics, Analytics-EventLogging: Send raw server side events to Kafka using a PHP Kafka Client {oryx} [0 pts] - https://phabricator.wikimedia.org/T106257#2008963 (Nuria) [18:51:35] Analytics, Analytics-EventLogging: Send raw server side events to Kafka using a PHP Kafka Client {oryx} [0 pts] - https://phabricator.wikimedia.org/T106257#2008988 (Nuria) p:Normal>High [18:52:07] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2008995 (jcrespo) > I am making an incorrect assumption that dbs on the list should always be replicated to this servers There is some separation between lab... [18:52:58] Analytics, Analytics-EventLogging: Send raw server side events to Kafka using a PHP Kafka Client {oryx} [0 pts] - https://phabricator.wikimedia.org/T106257#2009006 (Milimetric) [18:53:00] Analytics: Server side eventlogging should publish to kafka and not use udp {stag} - https://phabricator.wikimedia.org/T124813#2009005 (Milimetric) [18:55:07] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2009011 (jcrespo) Oh, I missread labtestwiki vs. labswiki. If labstestwiki is on s3, and it is as small as I suppose, it should be already there. I will inves... [18:56:20] oh mann i shoulda come to tasking [18:56:22] sorry guys [18:56:28] was helping jeff green with more kafka stuff [18:56:59] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2009014 (jcrespo) I should have read again, my previous comment apply, then T126218#2008995. [18:57:25] Analytics: camus-wediawiki job should run in production (or essential?) queue [1 pts] - https://phabricator.wikimedia.org/T125967#2009015 (Nuria) [18:57:28] Analytics: camus-wediawiki job should run in production (or essential?) queue {hawk} [1 pts] - https://phabricator.wikimedia.org/T125967#2009017 (Milimetric) [18:59:28] Analytics: Use a new approach to compute monthly top 1000 articles (brute force probably works) [8 pts] - https://phabricator.wikimedia.org/T120113#2009020 (Nuria) [19:00:07] Analytics: Use a new approach to compute monthly top 1000 articles (brute force probably works) {slug} [8 pts] - https://phabricator.wikimedia.org/T120113#2009021 (Milimetric) [19:01:34] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2009022 (Addshore) >>! In T126218#2008995, @jcrespo wrote: > I strongly suggest to iterate over "all - silver.list", if that makes sense. If there is real int... [19:07:52] madhuvishy: btw, the wikimetrics deploy went perfectly smoothly [19:08:02] i did staging then prod [19:08:04] milimetric: oh yay :D [19:09:13] milimetric: i should have put a line in the fabric readme about restarting puppet after changing hiera config or waiting ~20 minutes for the changes to effect - should i do it or can you add it? [19:10:01] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2009045 (jcrespo) I have no idea where that is, but if its name is labstestweb*2*XXX, there is a high chance it is on a different datacenter (dallas). [19:10:54] milimetric: should we do that restbase change? [19:21:42] Analytics, DBA, WMDE-Analytics-Engineering: Replicate wikitech wikis to analytics-store.eqiad.wmnet - https://phabricator.wikimedia.org/T126218#2009089 (Krenair) The dblist containing both wikis with DBs hosted externally is wikitech.dblist. [19:22:58] ottomata: yeah, lemme eat something 'cause we just had meetings all day [19:23:06] *all afternoon :) [19:23:20] but I say all day 'cause i'm grumpy, so I need to eat :) [19:24:13] elukey: I'll do this now - if that's fine with you. https://phabricator.wikimedia.org/T125225 [19:25:16] nuria: I'm not aware of any progress or change in browser reports. What exactly is this about? I'm a bit behind on mailing lists. [19:27:34] madhuvishy: sure :) [19:27:45] elukey: cool! [19:28:09] Analytics-Kanban, Patch-For-Review: Lower parallelization on EventLogging to 1 consumer {oryx} [3 pts] - https://phabricator.wikimedia.org/T125225#2009149 (madhuvishy) a:elukey>madhuvishy [19:28:10] madhuvishy: I've also found a solution for the burrow email template, but the CR is still not ready.. Andrew made some comments and I am waiting for Nuria's CR [19:28:21] elukey: okay :) [19:28:38] but I want to add stuff like the lagcheck interval, that are config dependent [19:28:52] so anybody will be able to make calculation on the sliding window directly [19:28:59] not making assumption [19:29:47] right [19:30:06] ottomata: https://gerrit.wikimedia.org/r/#/c/269185 can you CR this? [19:33:23] looks good madhuvishy, shall I merge? [19:33:28] ottomata: yup! [19:35:54] elukey: i have +1 already [19:36:19] all right shall I merge? [19:37:49] elukey: ottomata will merge i think [19:37:53] there is a comment in https://gerrit.wikimedia.org/r/#/c/268594/7/hieradata/role/common/analytics/burrow.yaml [19:38:26] all right I'll amend mine tomorrow :) [19:38:33] elukey: ah , sorry,m not my patch, I thought we were talking about yours [19:39:04] ottomata: i should restart puppet for the change to take effect - should i restart eventlogging? [19:39:09] nope! Mine is not ready yet, I wanted to merge yours before to re-use the lagcheck interval in the email template [19:39:16] puppet has run [19:39:17] elukey: sorry, i need to submit 1 more [19:39:19] madhuvishy: yes restart el [19:39:24] ottomata: oh cool okay doing [19:39:28] nuria: sure! I'll restart tomorrow :) [19:39:41] elukey: can merge! [19:39:42] :) [19:39:49] he's got da powerrrrr [19:40:02] DA POWA [19:40:24] but Nuria needs to submit a change, so you'll do it later on :P :P [19:40:52] oh ok hehh [19:40:57] elukey: what about yours? looking..>. :) [19:41:03] oh you want nuria's to go first? [19:41:18] milimetric: i'm ready for aqs stuff whenever you are [19:41:41] ottomata: yes! so I'll re-use the parameter and I'll remove heira stuff [19:42:07] ottomata: looks good - there's only one consumer running [19:42:28] k [19:42:29] perfect [19:43:42] will keep an eye on grafana [19:43:44] logging off, talk with you tomorrow!! [19:43:49] laters! [19:43:57] good night elukey :) [19:44:32] Hm.. Kafka metrics in graphite broke. I guess the metric changed? [19:44:57] Previously: kafka.kafka*.kafka.server.BrokerTopicMetrics [19:45:01] Currently: kafka.cluster.analytics-eqiad.kafka.*.kafka.server.BrokerTopicMetrics [19:45:27] it chnaged Krinkle [19:45:35] to make it work with mulitple clusters [19:45:49] sorry, shoulda thought to notify you [19:46:04] check https://grafana.wikimedia.org/dashboard/db/kafka for some usage [19:46:09] did some templating stuff [19:46:30] Hm.. k [19:46:38] It doens't go back more than a week or so [19:46:55] the metrics weren't copied over, they are just new metrics now [19:47:05] but, if you are looking at last week, it may be weird if you include kafka1012 [19:47:12] it was down for a while last week [19:47:32] and for some reason grafana won't show data if it has to render for all brokers in the time period when it was down [19:51:11] A-team, I'm off for tonight ! [19:51:17] See y'all tomorrow :) [19:51:26] night joal :) [19:58:11] ottomata: still looks down to me now? [19:58:13] 0 messages [19:58:54] anyway, the new pattern works fine [19:59:02] will have to update a number of dashboards eventually [19:59:06] It's quite a long property path [19:59:53] yeah :/ [20:00:00] Krinkle: , quick grafana q for you [20:00:16] i want have that kafka messages per sec metric [20:00:22] sorry messagesIn [20:00:23] i can get [20:00:27] OneMinuteRate [20:00:36] or I can get count ( which is always increasing) [20:00:40] i want to look at [20:00:47] sum messages per minute [20:01:02] I'm going to ask the services folks about deploying because that test script wasn't working, then we can deploy [20:01:06] i think i'd want to do sum(count, 1m) or something [20:01:07] ottomata: Check https://wikitech.wikimedia.org/wiki/Graphite#Counters first [20:01:11] but im' not sure [20:01:13] oo k [20:01:14] Use .rate always [20:01:25] Which is average rate per second [20:01:28] ah but these are not from statsd [20:01:35] Hm. k [20:01:43] well, i guess they are [20:01:44] hm [20:01:44] hang on [20:01:47] Still, Im fairly sure OneMinuteRate is per second [20:01:59] It's the avg rate / sec of one minute window [20:02:03] no they aren't [20:02:07] yeh it is [20:02:17] :) [20:02:21] ah scale [20:02:22] ok trying [20:02:33] Check https://grafana-admin.wikimedia.org/dashboard/db/eventlogging-schema [20:02:48] ja that looks right [20:02:52] https://grafana-admin.wikimedia.org/dashboard/db/eventlogging-schema?panelId=9&fullscreen&edit [20:02:57] ahh yeha [20:02:58] cool [20:03:18] OneMinuteRate scale(60) if you want per min [20:03:24] and always sumSeries() to add up from diff brokers [20:03:31] great perfect [20:04:02] it also has a MessageInPerSec....FifteenMInuteRate for example [20:04:09] k :) [20:05:01] ok, ottomata, ready to plan the deploy [20:05:10] so we have to sync puppet with code [20:05:27] not sure how to do that, I haven't done the deployer patch for the code yet [20:05:52] Krinkle: can't use sumSeries on this, because I've got 2 wildcards (broker, topic) [20:05:55] trying to groupByNode... [20:05:59] with scale [20:06:00] not sure that works [20:06:13] oh ja it does [20:06:14] cool [20:06:31] hmm maybe [20:07:02] ottomata: That's fine. sumSeriesWithWildcard() [20:07:09] To pick which one you want [20:07:11] to expand [20:07:15] oo [20:07:31] in Grafana, you can click on a function to get a (?) visible, which points to Graphite documentation [20:07:36] for e.g. signature params [20:07:48] http://graphite.readthedocs.org/en/latest/functions.html [20:08:02] (some listed there are not available in our install though as we have a slightly older version) [20:08:37] ottomata: Graphite has many problems. But not having enough functions is not one of them. [20:08:37] ja [20:08:46] uhhh, hm, is the node 0 indexed? [20:08:51] (milimetric 2 mins...) [20:08:57] Maybe, on Monday? I think so. [20:08:59] no rush [20:09:05] ha no, i mean [20:09:09] in the metric [20:09:11] like [20:09:12] a.b.c. [20:09:14] is c 3 or 2? [20:09:21] I know. I was joking. The point is, It's unpredictable. [20:09:23] haha [20:09:23] ok [20:09:25] really? [20:09:28] I thnk this one is [20:09:28] depends on function? [20:09:37] You sound so surprised ! [20:09:42] yes! haha [20:10:02] uhh i'm confused because my aliasByNode is not what I expect. [20:10:09] so i'm not sure if i'm doing sumwith wildcards wright [20:10:11] Right. That one probably isn't. [20:10:12] hang on, will save and link you [20:10:16] Just try it :) [20:10:27] well, sum with wildcards doesn't seem to care [20:12:01] Krinkle: https://grafana-admin.wikimedia.org/dashboard/db/eventbus?panelId=3&fullscreen [20:12:59] changing the number in sumSeriesWIthWildcards doesn't do what i'd expect [20:13:06] and i have no idea how aliasByNode thinks topic name is 8 [20:13:07] :) [20:13:18] anyway, no worries if you don't have time to look at it, that looks about right to me [20:13:21] milimetric: let's do it! [20:13:45] ok milimetric so [20:13:50] i merge change and run puppet on nodes [20:13:59] no wait [20:14:01] then, we do a deploy to just aqs1001 [20:14:10] restart restbase there [20:14:12] then you test [20:14:16] then if good, we proceed with others? [20:14:22] i'm not sure the code change has what i was expectign: https://gerrit.wikimedia.org/r/#/c/269199/1,publish [20:14:26] it's huge and hard to read [20:14:36] ottomata: IT seems these metrics are only on one broker each [20:14:37] oh, ok [20:14:39] so it's not very visible [20:14:40] check with services? [20:15:14] sum...(4) seems to do it [20:15:23] oh ya, hm, because the topics only have one partition [20:15:24] If you remove the alias() you'll see which one in the metic name disappears [20:15:58] huh ok i see [20:15:58] cool [20:16:01] Which is why aliasByNode() changes depending on presence of sumSerieswithWildcard [20:16:02] ohhhh [20:16:04] got it! [20:16:07] that makes sense now [20:16:07] yeeeeeah [20:16:18] It's all piped [20:16:47] You can even aliasByNode() and then aliasSub() to transform the chosen name [20:16:55] ottomata: no, it's good, we can deploy [20:16:58] (which is nicer than building a massive regex to catch the right property) [20:17:00] anyway :) [20:17:03] so the directions are the *deploy bullet: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/AQS#Deploying [20:17:16] thanks Krinkle, that's all good then [20:17:19] if you run puppet, I think the other nodes would die if they restarted [20:17:20] yw [20:17:28] so can we run puppet on just the one we're updatign first? [20:18:01] yes [20:18:08] puppet will restart restbase? [20:18:13] on config change? [20:18:16] don't think so [20:18:21] but just in case they restart for other reasons [20:18:26] they'll break with the new config [20:18:31] it's nonsense to the old code [20:19:46] milimetric: no it won't [20:19:49] ok [20:19:53] so we can stop puppet on all [20:19:54] and just do one [20:19:57] ok [20:20:01] https://gerrit.wikimedia.org/r/#/c/268560/ is the puppet change btw [20:20:11] ja [20:20:41] ok puppet stopped [20:20:47] so, i should merge puppet and run on aqs1001? [20:21:09] milimetric: ? [20:21:52] i think so [20:21:57] nuria: to add autoincrement ids - https://gerrit.wikimedia.org/r/#/c/188270/6 has to be reverted - and some migration needs to run on all the tables to add the field? will we do the migration? [20:22:22] yeah, ottomata, let's do it [20:22:33] (i'm just weirded out by how many changes there are :)) [20:23:00] milimetric: you can do deploys and restarts, ja? [20:23:17] (running puppet on aqs1001) [20:23:32] ottomata: don't think so [20:23:39] but if i can, it's news to me [20:23:51] yeah, i don't think so, I don't even have ansible [20:23:54] i thought yall worked that out with proper sudo permissions [20:24:08] ok config updated [20:24:12] no, i think there was no way to do it without giving us full sudo and nobody thought that was a good idea [20:24:13] including me [20:24:22] we were just gonna wait for luca and [20:24:28] oh shit, elukey: we're doing the deploy [20:24:32] i forgot to ping you [20:24:39] oo hey, we should move this page [20:24:40] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/AQS [20:24:45] as AQS is not technically analytics cluster [20:24:52] yeah, feel free to move it [20:26:06] (you created it there so I left it :)) [20:27:38] haha I did?! :) [20:27:43] ok so [20:27:50] puppet finish? [20:28:27] yes [20:28:28] madhuvishy: no, we cannot do it as none of us has permits, it is something to coordinate with jynus [20:28:55] madhuvishy: so we need to do code changes/test changes , deploy those [20:28:59] nuria: right - should we write the script to do it? [20:29:13] milimetric: just did [20:29:15] • git checkout master && git checkout -- src && git branch -D sync-repo [20:29:19] madhuvishy: and once that is done talk to jaime about updating the rest of the tables [20:29:19] error: pathspec 'src' did not match any file(s) known to git. [20:29:25] nuria: ya that part should be fine i think. alright [20:29:41] madhuvishy: now, testing needs to be done to see how code affects new tables created/old tables existing ..etc [20:29:47] but that can be done on bet alabs [20:29:56] ottomata: that's my part of the deploy i think [20:30:02] nuria: yeah got it [20:30:07] your part is just the last main bullet under "deploy" [20:30:14] oh ok [20:30:28] righ tright [20:30:30] in the ansible-deploy repo [20:30:31] ansible-playbook --check -i production -e target=aqs roles/restbase/deploy.yml [20:30:41] but probably update the repo first in case there's been changing [20:30:57] ja [20:31:02] how do we target just one node? [20:31:12] uh [20:31:15] no idea, asking [20:31:25] -l aqs1001* [20:31:29] :) [20:31:32] woa [20:31:41] as in 'limit' [20:31:42] he can see my questions before I send them on IRC [20:31:52] ok checking [20:32:04] gwicke is aBOT [20:32:15] that worked milimetric [20:32:19] shall I do real deploy with thta? [20:32:25] yeah [20:32:31] k [20:32:37] lol [20:32:52] TASK: [restart restbase] [20:32:55] TASK: [check port 7231] [20:32:58] ok: [aqs1001.eqiad.wmnet] [20:33:04] PLAY RECAP ******************************************************************** [20:33:04] aqs1001.eqiad.wmnet : ok=3 changed=2 unreachable=0 failed=0 [20:33:05] looks good [20:33:08] milimetric: check from your end? [20:33:11] checknig [20:33:23] ottomata: it says all good [20:33:31] I think we're good! [20:33:35] let's do the others [20:34:05] k [20:34:14] running puppet on others [20:35:26] elukey: patch sent https://gerrit.wikimedia.org/r/#/c/268594/8 [20:35:34] ok puppet run [20:35:56] milimetric: check without limit looks good [20:35:58] deploying all [20:36:03] cool [20:36:54] milimetric: 1003 failed [20:36:58] failed: [aqs1003.eqiad.wmnet] => {"failed": true} [20:36:58] msg: Failed to download remote objects and refs [20:36:58] FATAL: all hosts have already failed -- aborting [20:37:05] hm [20:37:05] 1001 1002 look good [20:37:08] should i try just 1003 again? [20:37:23] it says it's healthy so it must be running the old stuff [20:37:27] i think so [20:37:27] yeah [20:37:27] sure, try again [20:37:30] it failed at checkout [20:37:46] ok, worked that time [20:38:35] weird [20:39:47] ottomata: but did it finish the deploy? [20:41:23] yewsa [20:41:24] yes [20:41:28] aqs1003.eqiad.wmnet : ok=3 changed=2 unreachable=0 failed=0 [20:42:11] ok, all the tests are fine [20:43:12] milimetric: just read the message, don't worry! [20:43:18] ottomata: hm... everything's broken [20:43:27] https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Selfie/daily/2015100100/2015103000 [20:44:01] milimetric: this is signaling: "we need a test environment" [20:44:55] what is this "test" thing that you speak of? [20:45:19] I'm guessing something's wrong in the puppet config [20:45:20] ooook, milimetric what to do? [20:45:34] mobrovac: any idea? All the pageview API URLs are not-found now [20:46:28] the unit tests in restbase / hyperswitch must not have been using something that's now in the production config [20:47:06] milimetric: also post portem, but 1st things 1st cc gwicke [20:47:34] * mobrovac taking a look [20:47:48] I'm chatting in -services nuria / mobrovac [20:48:00] milimetric: can we rolback? [20:48:12] not easily, because it's a puppet change [20:48:33] but maybe if we check out reflogs [20:48:33] we shoudl be able to "unmerge" via gerrit [20:48:49] milimetric: so we restore availability [20:49:03] yeah, but in this case it would have to unmerge two unrelated repos [20:49:04] you can deploy the previous version [20:49:06] to two unrelated points [20:49:13] gwicke: it's puppet and code together [20:49:25] we can do that if there's no obvious problem that we can just fix [20:49:40] seems to me it has to be something simple that we all missed in the new config [20:50:05] the unit tests worked, but it would be really nice to have a test environment [20:50:25] did you do any tests on the canary? [20:50:36] like a test request for some pageview data? [20:51:54] re roll-back: can you roll back puppet & code? [20:52:21] i can do [20:52:28] well,i know how to do puppet [20:52:54] for the code, you specify the previous deploy repo hash in the config & do another deploy [20:53:39] milimetric: shall I? [20:54:01] yeah ottomata, I guess nobody else is chiming in with any quick solutions [20:54:25] ok [20:54:27] milimetric: found the problem [20:54:40] ok ottomata maybe wait a sec :) [20:54:42] oh, should I wait? [20:54:42] ok [20:54:54] projects/aqs_default.yaml creates the /metrics path [20:55:19] and it shouldn't [20:55:55] so should it just be: [20:56:00] x-modules: [20:56:00] /: [20:56:00] - path: v1/pageviews.yaml [20:56:02] ? [20:56:17] yup [20:56:23] ok, I'll send the pull [20:56:27] any other changes mobrovac ? [20:56:33] nope [20:56:42] that should work [20:56:59] tests will probably need to be adjusted as well [20:57:36] mobrovac: lemme fix that in a separate commit, let's just get this out to fix the deploy [20:57:51] mobrovac: or should I just fix that locally on the boxes and see if it works? [20:58:01] confirmed, curl -v localhost:7232/analytics.wikimedia.org/v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Selfie/daily/2015100100/2015103000 gives a 200 [20:58:02] yeah, I'll do that, and then update the pull request with tests [20:58:27] yup, milimetric, fixing it on the hosts right now is a better option [20:58:35] milimetric: you need to be root for that though [20:58:49] oh, i didn't know [20:58:56] ottomata: you wanna hang out and do it together in the batcave? [20:58:57] ottomata can do it [20:59:09] thanks for looking into this, mobrovac [20:59:13] np [20:59:23] * mobrovac is rewarding himself with lunch [20:59:29] eh? [21:00:15] milimetric: not sure what needs changed [21:00:24] ottomata: in /srv/deployment/restbase/deploy/src/projects/aqs_default.yaml, replace line 16 to say only /: [21:00:51] milimetric: can you do that? I don't have that repo cloned [21:01:02] ottomata: on the aqs nodes in prod [21:01:06] oh [21:01:20] tath's why we need your super powers [21:01:21] :) [21:01:39] so say only what? [21:02:06] i think i need a paste [21:02:09] am getting emoticons in irc [21:02:10] it has to read slash-double-colon instead of slash-metrics-double-colom [21:02:30] ottomata: i'm in the batcave if you wanna share I can show you [21:02:31] https://gist.github.com/ottomata/182be7f537fefac65c80 [21:02:32] ? [21:02:39] yeah, like that [21:02:49] yes ottomata [21:02:58] ok [21:03:00] just saved [21:03:11] restart restabse? [21:03:17] yup [21:03:38] kk, works now [21:03:43] so that fixes ir [21:05:06] yes, thx mobrovac [21:05:16] i'll see if the tests need fixing [21:07:20] milimetric: i'm heading out for lunch, so if you have the pr ready in the meantime ping gwicke or urandom [21:07:30] k, thx [21:53:38] ottomata: can we merge? https://gerrit.wikimedia.org/r/#/c/268594/8 [21:56:01] yus! [21:56:22] ottomata: thankssir [21:57:48] Analytics-Kanban, Reading-Admin, Patch-For-Review: Tabular layout on dashiki [8 pts] {lama} - https://phabricator.wikimedia.org/T118329#2009850 (Krinkle) * Time series (line graph?) for browser family (going back at least 3-5 months) * Time series (line graph?) for browser family + version (going back a... [21:58:06] Analytics-Kanban: Dashiki visualization that shows a hierarchy [13 pts] {lama} - https://phabricator.wikimedia.org/T124296#2009857 (Krinkle) [21:58:08] Analytics-Kanban, Reading-Admin, Patch-For-Review: Tabular layout on dashiki [8 pts] {lama} - https://phabricator.wikimedia.org/T118329#2009856 (Krinkle) [21:59:23] hmm, nuria, change was [21:59:24] +intervals= 10 [21:59:50] ottomata: https://gerrit.wikimedia.org/r/#/c/268594/8/hieradata/role/common/analytics/burrow.yaml [21:59:55] mforns: milimetric: Added a comment summarising our meeting [21:59:58] thx :) [21:59:59] dunno why it wasn't 100 [22:00:04] but ja, i noticed you added space [22:00:09] should fix that [22:00:26] intervals= <% [22:00:28] vs before [22:00:32] intervals=10 [22:00:38] it'll probably work, but is inconsistent [22:01:03] nuria: maybe its because you don't have a space in the yaml [22:01:05] not sure [22:01:08] burrow::lagcheck_intervals:100 [22:01:40] ottomata: ok, will resubmit [22:05:50] ottomata: added space: https://gerrit.wikimedia.org/r/#/c/269304/1/hieradata/role/common/analytics/burrow.yaml [22:06:18] nuria: please remove extra space from burrow.cfg.erb too [22:06:29] https://gerrit.wikimedia.org/r/#/c/268594/9/modules/burrow/templates/burrow.cfg.erb [22:06:38] you put a space after the = [22:08:06] ottomata: done [22:08:48] Analytics-Tech-community-metrics, DevRel-February-2016: Data in korma project pages has confusing labels, is difficult to understand - https://phabricator.wikimedia.org/T110524#2009926 (Aklapper) Proposing to close this task somewhere between declined and resolved. Some items on the left have now mouse-o... [22:12:33] nuria: that worked [22:12:34] +intervals=100 [22:12:40] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#2009935 (Aklapper) [22:12:42] Analytics-Tech-community-metrics, DevRel-February-2016: Mailing lists recently added to korma do not have "Top senders" data created (JSON file is 404) - https://phabricator.wikimedia.org/T123929#2009936 (Aklapper) [22:13:27] Analytics-Tech-community-metrics, DevRel-February-2016: Key performance indicator: Top contributors: Find good Ranking algorithm fix bugs on page - https://phabricator.wikimedia.org/T64221#660476 (Aklapper) [22:15:35] Analytics: Add pivot parameter to tabular layout graphs {crow} [? pts] - https://phabricator.wikimedia.org/T126279#2009951 (Milimetric) NEW a:Milimetric [22:23:32] ottomata: the pull request I submitted is gonna get merged soon, but we don't have to deploy it tonight [22:23:45] ok [22:23:46] thanks for the fix, it appears stable [22:23:49] ja lets do tomrorow [22:23:51] yup [22:31:00] Analytics-Wikistats, Blocked-on-Operations, Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (Krinkle) NEW [22:31:20] Analytics-Wikistats, operations, Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010007 (Krinkle) [22:32:21] milimetric: ottomata: merged the PR and updated the gerrit master [22:32:26] Analytics-Kanban: Improve the data format of the browser report {lama} - https://phabricator.wikimedia.org/T126282#2010009 (mforns) NEW a:mforns [22:34:13] thx mobrovac, appreciated, and thx for the fix [22:34:28] np [22:34:28] it probably would've taken me a week [22:34:45] :) [23:37:23] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Refactor analytics/cdh roles to use hiera, setup Analytics Cluster in beta labs. [21 pts] - https://phabricator.wikimedia.org/T109859#2010163 (Ottomata) [23:45:17] Analytics: Provide API for sampling pageviews - https://phabricator.wikimedia.org/T126290#2010185 (Tgr) NEW