[04:24:07] Analytics, MediaWiki-Vagrant: confluent-kafka (dependency of eventlogging) fails to build in mediawiki-vagrant - https://phabricator.wikimedia.org/T140447#2465102 (dpatrick) [04:29:33] Analytics, MediaWiki-Vagrant, Patch-For-Review: confluent-kafka (dependency of eventlogging) fails to build in mediawiki-vagrant - https://phabricator.wikimedia.org/T140447#2465119 (bd808) Open>Resolved a:dpatrick [08:44:11] Analytics-Kanban, Operations, Traffic, Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2465314 (elukey) Ran again the query, no empty dt fields for the past hours too. The issue seems solved! We'll might need to tune a... [08:44:19] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2465315 (elukey) [10:53:36] (PS1) Addshore: Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) [11:03:06] * elukey lunch! [11:09:42] buon appetito [11:41:17] lol [11:41:19] :) [11:41:28] grazie in ritardo mforns! [11:46:27] :] [11:51:53] (CR) WMDE-leszek: Count global users with beta features enabled (3 comments) [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) (owner: Addshore) [12:00:18] morning [12:01:03] mforns: reporting for duty [12:01:40] milimetric, hey :] [12:02:20] what should we do? [12:02:28] https://hangouts.google.com/hangouts/_/wikimedia.org/eds-batcave ? [12:02:30] I don't know, what's there to do? [12:02:35] hehe [12:02:38] :P [12:03:01] ok, sure, let's do ehr-batcave though - edit history reconstruction [12:03:04] wait what's eds? [12:03:17] hehe, edit data sprint [12:03:21] whatever [12:03:37] https://hangouts.google.com/hangouts/_/wikimedia.org/ehr-batcave [12:50:49] (CR) WMDE-leszek: Count global users with beta features enabled (1 comment) [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) (owner: Addshore) [12:55:10] (CR) WMDE-leszek: [C: -1] Count global users with beta features enabled (1 comment) [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) (owner: Addshore) [13:07:54] (PS2) Addshore: Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) [13:12:06] mobrovac: o/ time for a scap3 question? [13:15:10] (PS3) Addshore: Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) [13:25:12] hmmm elukey you have had more eyes on GC metrics than I have [13:25:26] i'm looking at the your jmxtrans patch, trying to understand how the other GC things work [13:25:43] why do I reuse resultAlias a few times for different metrics? [13:26:06] ottomata: o/ [13:26:15] that one is my only doubt [13:26:35] I supposed that since only a couple of them can be active at the same time maybe it is fine to repeat? [13:26:41] i was about to advise to use name=* and typeNames => ['name'] [13:26:42] aye [13:26:47] i was gonna say that [13:26:51] but there are already a few there? [13:27:22] no idea, I followed what I found in there.. is it that important the resultAlias? [13:27:54] ja it changes the metric name sent [13:27:58] no? [13:28:51] ah snap [13:28:55] didn't know it [13:30:14] elukey: i have not looked at this in soooo long! [13:32:23] ottomata: whenever you have 5 minutes I'd need to brainstorm about scap and refinery [13:32:39] I think I got it but there are some unclear points [13:32:43] ok cool [13:32:48] gimme a few, am looking at jmxtrans stuff [13:32:54] will leave a comment there [13:33:02] i think we can clean up some crap i did too :p [13:36:23] elukey: i can brainstorm via IRC now, or batcave in like 30 mins :) [13:37:00] sure! I can write something and then you'll answer when you have time [13:37:04] ok? [13:38:01] ja! [13:38:19] i have time now, i'm just ina room with some other people atm so can't jump in batcave [13:40:19] sooo I've read https://doc.wikimedia.org/mw-tools-scap/scap3/ssh-access.html and I merged it with https://wikitech.wikimedia.org/wiki/Scap3/Migration_Guide [13:40:56] so I'd say that we could create a pair of pub/priv ssh keys for deploy-analytics [13:41:10] upload the priv to the private repo under the proper folder [13:41:19] add the config in scap server.yaml [13:41:45] and then in scap::target configure the appropriate user etc.. [13:42:18] but (maybe I am wrong) I didn't see any mention to how to add the ssh pub key to the target host [13:42:28] in this case analytics1027 [13:42:34] to allow tin to connect [13:42:42] (and I didn't see any config for eventlogging) [13:42:45] am I crazy? [13:43:51] hmmm sounds about right [13:43:54] adding pub key.. hmmm [13:44:03] i thought scap::target does that, looking [13:44:22] ja elukey scap::target does it [13:44:25] via $manage_user [13:44:50] i think that might not be used for eventlogging [13:44:58] ja [13:45:01] for eventlogging it is in [13:45:06] eventlogging::deployment::target [13:45:10] since the eventlogging user already existed before scap [13:45:17] and was used for non scap things [13:45:37] checking! [13:46:01] elukey: there is also the matter of re-arming the keyholder with the new key [13:46:07] yep yep [13:46:15] https://wikitech.wikimedia.org/wiki/Keyholder [13:46:16] but it should be a one time thing right? [13:46:17] aye cool [13:46:19] just so you know [13:46:19] yup! [13:46:22] super [14:06:30] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2465959 (Ottomata) NICE WORK! [14:07:43] \o/ [14:54:14] ottomata: about the ssh pub key - content => secret("keyholder/${key_name_safe}.pub"), [14:54:31] do I need to add it somewhere, or will I just need to add it once manually to the keyholder? [14:55:13] hmm, that i'm not sure ummm [14:55:38] maybe secret module in private? [14:55:57] yup [14:55:58] elukey: [14:56:03] private/modules/secret/secrets [14:56:10] /keyholder [14:56:11] i thikn [14:56:15] yup [14:57:32] Analytics-Cluster, Operations, Packaging: libcglib3-java replaces libcglib-java in Jessie - https://phabricator.wikimedia.org/T137791#2466093 (fgiunchedi) p:Triage>Low [14:57:44] Analytics-Cluster, Analytics-Kanban, EventBus, Operations, and 2 others: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2466094 (Ottomata) Woot, preliminary dash here: https://grafana.wikimedia.org/dashboard/db/zookeeper [14:58:35] ottomata: ah snap I put the private under /private/modules/secret/secrets/ssh/tin [14:58:38] grrr [14:58:44] I am going to move it and add the pub [15:00:55] ottomata: no change in log sizes, i guess we can restart namenode but will try -again- to find more info about this [15:01:16] hah yeah [15:01:17] growl [15:01:27] well, i mean, we did restart RMs near end of day yesterday [15:01:29] so hasn't been 24 hours [15:01:33] but ja, you are probably right [15:05:58] elukey: sorry, was out for a bit [15:06:02] elukey: what's up? [15:06:41] mobrovac: already resolved thanks :) [15:06:47] ottomata: keys added! [15:06:59] ETOOMUCHTRIBALKNOWLEDGE [15:07:39] ottomata: also i think time is relative to the time the application took to run, so it might not apply retroactively [15:09:06] (CR) WMDE-leszek: [C: 1] Count global users with beta features enabled [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/299125 (https://phabricator.wikimedia.org/T140229) (owner: Addshore) [15:09:39] Analytics, Analytics-Cluster, Analytics-Kanban, Deployment-Systems, and 3 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2466150 (elukey) Just generated and added the pub/priv key to the private repo. [15:13:12] hmm, interesting possibly nuria_ doesn't seem quite right but could be. [15:13:36] ottomata: i am going to spend an hour looking for docs see if i find something [15:13:53] k [15:14:10] nuria_: you could probably read code for it and figure out how it does it and see if it looks at really logs too [15:14:30] ottomata: ya, looked at deletion service in source forge [15:14:36] Analytics-Cluster, Analytics-Kanban, EventBus, Operations, and 2 others: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2466180 (mobrovac) Nice! [15:14:46] ottomata: but i could not see anything related to thta setting [15:14:48] *that [15:26:08] ottomata: ok, the thing that deletes these logs lives inside JobHistoryServer [15:27:35] nuria_: HMM, you know...which we don't explicitly run... [15:27:42] this is something i have been a little confused about [15:28:04] no wonder [15:28:06] oh wait [15:28:09] YES we do [15:28:15] man i am stale on this stuff [15:28:17] its called hadoop-mapreduce-historyserver [15:28:19] and we run it [15:28:21] of course we do. [15:28:38] this makes more sense, i had some notion that it had moved inside of yarn rm or something [15:28:55] ok, let me look at the code but that would be the service taht needs a restart (i will document this once we .. ahem... know what is going on) [15:28:58] nuria_: this can just be restarted, will do so now. [15:29:04] ottomata: k [15:29:05] ok, want me to wait? [15:29:28] nuria_: ? [15:29:38] ottomata: no no, please restart [15:29:43] k [15:29:45] !log restarting hadoop-mapreduce-historyserver to apply yarn log aggreation retention settings [15:30:22] nuria_: OH [15:30:25] will you look at that... [15:30:26] Deleting aggregated logs in hdfs://analytics-hadoop/var/log/hadoop-yarn/apps/addshore/logs/application_1441303822549_25187 [15:30:27] haha [15:30:33] ottomata: ta-ta-channnnn [15:30:34] lots of those in logs [15:30:42] nice nuria_ sorry for that confusion, i shoulda known that [15:30:42] ottomata: does it look good? https://puppet-compiler.wmflabs.org/3357/ [15:31:38] {"name"=>"java.lang:name=* "typeNames"=>["name"] "result_alias"=>"GarbageCollector" [15:31:40] looks good to me! [15:31:40] urandom: o/ I'd need you for a couple of questions [15:31:56] about life, univers and all that [15:31:57] elukey: you can test in labs if you want, but this won't hurt anything to just apply in prod and fix if it isn't quite right [15:32:04] up to you [15:32:15] might be faster easier to just do in prod [15:32:51] mmmmmm [15:33:03] I'll probably grab some time to test it on Monday [15:33:11] it seems not a good idea on Friday [15:33:23] especially on a EU evening :P [15:33:27] Analytics-Cluster, Analytics-Kanban, EventBus, Operations, and 2 others: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2466240 (Ottomata) Alerts too! https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=conf1001&service=Zookeeper+Alive+Client+Connecti... [15:33:53] elukey: yeah, but we don't have alerts or antyhing based on it [15:33:55] and it is just monitoring [15:33:57] but jaaa [15:34:01] i guess itis your evening [15:34:04] elukey: when the time comes, choose the one on the left. [15:34:04] and i don't want to babysit it :p [15:34:12] elukey: trust me. [15:34:50] ottomata: now documented here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Logs [15:36:08] urandom: :D [15:36:48] urandom: so I was reviewing https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/AQS_Settings today and one thing that I wondered (popped up earlier also discussing with Joal) is the space requirement for each cassandra instance [15:37:09] we need ~6TB for each cassandra instance to have 3 years of retention [15:37:40] this is because we think that one year of data will be ~2TB on each instance.. [15:37:48] but this is true for the _actual_ cluster [15:37:51] not the new one [15:37:57] so I am wondering.. going from 3 to 6 [15:38:05] (and changing compaction) [15:38:23] should change also the space consumption right? (I suppose reducing it) [15:38:41] 2T compressed, yes? [15:38:44] because from my calculations with 2 disks failure we are sort of in a very bad position [15:39:58] elukey: i might be confused, what do you mean going from 3 to 6? [15:40:13] TB? [15:40:44] nope instances sorry [15:40:44] no, that's not right... [15:40:46] or nodes [15:41:03] ottomata: ok, i see 3TB of logs got deleted [15:41:22] elukey: oh, i see [15:41:41] so now I am confused about the compression [15:41:48] I know that the sstables file are compressed [15:41:56] nuria_: its still going too! [15:41:58] if you want to watch [15:42:01] on analytics1001 [15:42:06] tail -f /var/log/hadoop-mapreduce/mapred-mapred-historyserver-analytics1001.out [15:42:09] but nodetool status reports 2.2TB meanwhile df -h 2.6TB [15:42:17] ottomata: k, good, will check in a bit. [15:42:19] but I guess it is fine [15:42:34] elukey: which node? [15:42:35] urandom: I am trying to understand if raid10 would be out of the question or not [15:42:38] aqs1001 [15:42:43] ja good for documenting nuria, pretty weird that it is a yarn-site setting for a 'mapreduce' service. i bet that is some legacy thing [15:43:54] nuria_: https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=25&fullscreen :) [15:44:19] elukey: so w/ aqs100[1-3], you figured 2T per host of utilization, per each year of data, is that right? [15:44:58] urandom: this is what I got, but I could be wrong [15:45:06] where in aqs100[1-3], host and instance can be used interchangeably [15:45:16] but calculations fit, since we have raid0 arrays of ~6TB [15:45:30] urandom: yeah [15:46:10] so yeah, i would expect that w/ aqs100[4-6], that would work out to roughly 1T per instance [15:46:38] the same quantity of data [15:47:44] also, nodetool status, that value "load" is supposed to be the total of SSTable files on disk [15:47:49] so "compressed" [15:47:57] okok.. now that I recall during some tests joal was puzzled about the size of the sstables on aqs100[456] after some load job.. [15:48:04] okok [15:48:05] why it reports differently than df, i have no clue [15:48:12] ok so I am not crazy [15:49:21] anyway, i'm may have lost track of your concerns playing catch up here, could you summarize them again? [15:49:29] yes sure [15:49:48] so we chose to have a 4 disks raid0 array for each cassandra instance [15:50:13] yup [15:50:22] and if I made my assumptions correctly in https://wikitech.wikimedia.org/wiki/User:Elukey/Ops/AQS_Settings the new cluster will tollerate only 2 disks failures [15:50:45] by disks, you mean raid arrays? [15:50:47] even worse, each time that a disk fail a RAID0 array will need to get rebuilt [15:51:06] and i guess you mean simultaneous failures? [15:51:13] or in a rapid sequence [15:51:24] you know, very bad luck for analytics week or something similar :) [15:51:42] ops asked to us to motivate really well why we put raid0 in prod [15:51:51] since disk failures should not become emergencies [15:52:05] I mean, a couple [15:52:24] with RAID10 we'd have more resiliency [15:52:40] but it was discarded long time ago because we needed space [15:52:52] elukey: you know, i keep saying this (a lot), but we seem to get drives of the same model from the same source [15:53:06] i have to assume they are from the same lot [15:53:27] and so having disks fail in rapid succession is actually not that surprising [15:53:40] it would not surprise me at all [15:54:04] this has happened to me more than once [15:54:11] we should really be mixing things up a bit [15:54:22] yeah... [15:54:33] so from https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-system and IIRC, joal loaded one month of data [15:54:40] that is ~200GB [15:54:55] per instance? [15:54:59] yes [15:55:06] so assuming a linear increase [15:55:14] that's a lot more than 1T [15:55:21] ah snap [15:55:41] do you mean for a year? [15:55:46] yeah [15:55:51] that's 2400G [15:56:45] may I ask what is the calculation that you made? [15:56:53] 200 * 12 [15:57:09] sure that is not 1.2 TB as my brain was suggesting [15:57:14] sigh friday brainfarts [15:57:33] it's friday morning for me and i am already struggling, so no worries :) [15:57:47] could this be due to the levelled compaction? [15:58:00] i don't think so [15:58:01] or it is completely not relevent? [15:58:03] okok [15:58:31] i mean, i guess that compaction could effect compression efficiency [15:58:42] but i wouldn't expect by a lot [15:58:48] * urandom is thinking out loud [15:59:20] but this means that doubling the instances does not reduce sstable size? [15:59:26] it is a bit weird [15:59:48] not sure what you mean [15:59:59] the largest SSTable will be smaller when you have more instances [16:00:07] but the total amount of data remains unchanged [16:00:39] yeah but the keyspace should be split among more instances, so I'd expect less data per node in the sstables.. no? [16:01:01] elukey: right now we have 2 TB per node with half a year [16:01:26] that would fit the match [16:01:30] s/match/math/ [16:01:54] elukey: standdup? [16:01:55] nuria_: we have HALF a year? [16:02:12] 4T per year per instance @ 3 instances, would be 2T per year per instance @ 6 instances [16:02:13] * elukey cries in a corner [16:02:34] yes now it makes sense [16:02:44] I was convinced about a full year stored [16:02:52] sorry urandom, now I feel better [16:03:53] elukey: also, remember you need working space for range moves and compaction, so you don't want to run full utilized [16:04:02] s/full/fully/ [16:05:21] elukey: NO SORRY, 1 year [16:06:06] * elukey cries again [16:06:29] nuria is WRONG!!! [16:07:38] urandom, elukey : pageviews from july 2015: https://analytics.wikimedia.org/dashboards/vital-signs/#projects=eswiki,itwiki,enwiki,jawiki,dewiki,ruwiki,frwiki/metrics=Pageviews [16:10:00] Analytics-Kanban, Operations, Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2466298 (Nuria) Open>Resolved [16:28:43] urandom: but while our total data size is 2TB per year and that doesn't change, now we have 6 cassandra instances when before we had 3. [16:29:31] urandom: all things being equal we would expect 1TB of data per cassandra instance if we have teh same replication , correct? cc elukey (just trying to understand) [16:30:57] something is off, I'll review it with joal on Monday [16:31:08] hopefully he has just loaded 2 months of data [16:31:14] and calculations will be ok :) [16:33:24] nuria_: you guys have 3x replication, yes? [16:33:34] urandom: yes [16:37:51] nuria_: i dunno, i think i might be pretty confused at this point [16:38:14] grafana shows 200MB per instance for what is reported to be 1 month [16:38:24] for the new nodes [16:38:44] which is more than 2T for a year at 6 instances [16:38:56] what would be more than 4T under the current production cluster [16:39:08] does that not fit? [16:41:59] to restate: that would be 2T per-instance per-year in a 6 instance configuration, or 4T per-instance per-year under a 3 instance configuration [16:45:56] yeah but we have 2TB for a year of data per instance (as it seems) in a 3 instance config [16:46:40] anyhow, let's resync on Monday with joal, I am probably missing something [16:46:51] elukey: ok, i thought nuria_ was saying that was actually 6 months [16:46:52] don't want to waste everybody's time :) [16:47:09] yeah but then she corrected and said the we have one year [16:47:28] auh, ok, i missed that [16:47:33] whew! [16:47:35] :)L [16:47:38] :) [17:00:55] Analytics-Tech-community-metrics, Developer-Relations (Jul-Sep-2016): Create basic/high-level Kibana (dashboard) documentation - https://phabricator.wikimedia.org/T132323#2466390 (Aklapper) [17:01:49] Analytics-Tech-community-metrics, JavaScript: Syntax error, unrecognized expression on Korma profiles - https://phabricator.wikimedia.org/T126325#2466391 (Aklapper) p:Low>Lowest [17:02:29] Analytics-Tech-community-metrics: Deployment of Gerrit (basic) panel - https://phabricator.wikimedia.org/T137999#2466393 (Aklapper) p:Triage>Normal [17:04:02] going afk team! byyyeeeeee [17:48:21] Analytics-Tech-community-metrics, Developer-Relations (Jul-Sep-2016): Identify Wikimedia's most important/used info panels in korma.wmflabs.org - https://phabricator.wikimedia.org/T132421#2466540 (Aklapper) p:High>Normal [17:48:52] hey mforns sorry finishing up phone call so I'll be a little longer [17:49:01] milimetric, np [17:49:19] Analytics-Tech-community-metrics, Phabricator: Decide on wanted metrics for Maniphest in kibana - https://phabricator.wikimedia.org/T28#2466546 (Aklapper) [17:50:29] Analytics-Tech-community-metrics: Maniphest support to be added to GrimorieLab - https://phabricator.wikimedia.org/T138003#2466551 (Aklapper) [17:50:31] Analytics-Tech-community-metrics, Phabricator: Decide on wanted metrics for Maniphest in kibana - https://phabricator.wikimedia.org/T28#2466550 (Aklapper) [18:02:43] sorry mforns, ok ready, to the ehr-batcave! [18:02:54] milimetric, :] there [20:08:55] Analytics-Tech-community-metrics, Developer-Relations (Jul-Sep-2016): Create basic/high-level Kibana (dashboard) documentation - https://phabricator.wikimedia.org/T132323#2467108 (Aklapper) Well, what I currently know and understand is on https://www.mediawiki.org/wiki/Community_metrics#dashboard.bitergi... [22:21:56] nuria_: congrats! [22:58:07] Does anyone know if pageviews are accurate for special pages? The numbers for Special:RecentChanges on enwiki show a big drop at the end of May, which seems odd. [22:58:09] https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-60&pages=Special:RecentChanges [23:01:39] It could just a bot losing interesting or something like that, though.