[00:04:57] <gwicke>	 anybody here around who knows how to restart or even deploy the eventbus rest service?
[00:05:51] <gwicke>	 https://phabricator.wikimedia.org/T140848
[02:02:14] <milimetric>	 I took a look and what Otto said via text sounds right to me, but no, I don't have any rights to eventbus, which looks like it's deployed to eventbus.codfw.wmnet: https://github.com/wikimedia/operations-puppet/blob/c41fcb19ba030a850ab929bf77a6614e10df4d36/hieradata/role/codfw/eventbus/eventbus.yaml
[02:32:18] <ori>	 gwicke: what do you need? kill -SIGHUP of eventbus?
[06:16:42] <elukey>	 mobrovac: let me know if you need help with T140848
[06:16:42] <stashbot>	 T140848: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted." - https://phabricator.wikimedia.org/T140848
[07:13:56] <mobrovac>	 elukey: yes, please
[07:14:11] <mobrovac>	 i do need it as i don't have access to the nodes
[07:16:12] <elukey>	 mobrovac: same as yesterday?
[07:16:42] <mobrovac>	 nope, let's try something else this time
[07:17:04] <mobrovac>	 elukey: can you check the git sha1 of the event-schemas repo on kafka1001?
[07:17:23] <elukey>	 sure
[07:20:45] <elukey>	 git log on /srv/event-schemas gives me 4db9d40d28d61c53cdbca77059d9a2a6e714af89
[07:21:21] <mobrovac>	 k, lemme check it
[07:22:06] <mobrovac>	 elukey: ok, so the sha1 is the good one
[07:22:30] <mobrovac>	 which may mean only one thing - yesterday's sighup didn't work as expected
[07:22:52] <mobrovac>	 this is something to be investigated, but for the time being, please restart the service
[07:22:59] <elukey>	 sure
[07:23:20] <elukey>	 I never got it we'd need to depool/pool via conftool or not for service restart
[07:23:30] <elukey>	 I guess no but I am not super familiar with eventbus
[07:23:57] <mobrovac>	 theoretically, we should, but for the time being we are losing events as is, so let's do it quick and dirty
[07:24:59] <mobrovac>	 also, restart it on kafak1002 too
[07:26:43] <elukey>	 done on kafka1001
[07:27:36] <elukey>	 but I can (MainThread) 500 POST /v1/events (10.64.0.113) 43.41m while restarting (now it is good) so for 1002 I am going to depool first
[07:28:29] <mobrovac>	 kk
[07:31:20] <elukey>	 mobrovac: done
[07:31:43] <mobrovac>	 kk, i'll try to test to the best of my abilities without access and report back
[07:32:33] <elukey>	 sure, let me know if I can help, trying to understand the impact
[07:34:07] <mobrovac>	 the impact is rather low at this point - we are losing page edit events from mediawikiwiki and tes2wiki
[07:34:25] <elukey>	 ahh okok
[07:34:51] <elukey>	 so new schemas are not loaded correctly creating impact where they are needed, since the new mw version counts on them
[07:36:07] <mobrovac>	 yup, exactly
[07:40:03] <mobrovac>	 HTTP/1.1 201 All 1 events were accepted.
[07:40:07] <mobrovac>	 \o/
[07:43:28] <elukey>	 \o/
[07:43:34] <elukey>	 so a sighup is not enough?
[07:58:11] <mobrovac>	 apparently not
[08:05:44] <jand_wmde>	 hey, do you know which data acquisition methods are mainly used by the embedded researchers of the product teams? I understand that the web request log needs quite some ahead-of-schedule work (handing in a research plan, getting permission…) and I wonder what their alternatives are?
[08:22:55] <elukey>	 joal: oozie this morning is not in the right mood - cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2016-7-19
[08:23:54] <elukey>	 jand_wmde: I am not a super expert but it could be easier for us if you explain your use case
[08:24:15] <elukey>	 we could give you a better feedback
[08:24:16] <elukey>	 :)
[08:24:35] <elukey>	 (probably not me but other people in the chan for sure)
[08:33:28] <joal>	 elukey: o/
[08:33:36] <jand_wmde>	 elukey: sure. My current case: I'm interested in how people use watchlists, since it is a wish of the German Community to improve their functionality. There are some suggestions like watchlist item expiry or multiple watchlists, however we are not sure about the underlying workflows, needs etc.
[08:33:36] <jand_wmde>	 So I'd be interested e.g.: how long items stay on watchlists, how many people have on there (distribution), which kind of pages are watched etc. (So it would be clearly exploratory) – but I'm unsure how.
[08:33:53] <joal>	 jand_wmde: hi :)
[08:34:10] <jand_wmde>	 joal: hej :)
[08:34:21] <joal>	 jand_wmde: My guess would be to use eventlogging
[08:35:04] <joal>	 jand_wmde: webrequest probably doesn't have teh data you're after, and it also seems relatively small (compare to traffic for instance), so eventlogging could make sense
[08:35:30] <joal>	 elukey: It makes a few days the daily cassandra loading job fails ...
[08:35:42] <joal>	 elukey: I need to investigate that
[08:36:06] <jand_wmde>	 joal: thanks!
[08:36:34] <wikibugs>	 Analytics-Kanban: Continue New AQS Loading - https://phabricator.wikimedia.org/T140866#2479035 (JAllemandou)
[08:36:39] <joal>	 np jand_wmde
[08:37:01] <joal>	 also jand_wmde, it's probably a good idea to double check my say with milimetric :)
[08:37:18] <joal>	 milimetric knows EVERYTHING ;)
[08:40:17] <elukey>	 joal: o/
[08:42:29] <wikibugs>	 Analytics-Kanban: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869#2479077 (JAllemandou)
[08:42:53] <wikibugs>	 Analytics-Kanban: 5Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869#2479077 (JAllemandou)
[08:43:04] <wikibugs>	 Analytics-Kanban: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869#2479077 (JAllemandou)
[08:43:42] <joal>	 elukey: I found a so-called "magic" param for cassandra compaction
[08:44:07] <elukey>	 --compact-all-the-things ?
[08:44:25] <joal>	 elukey: I'd love that one so much ... Unfortunately, not magic at that level :)
[08:44:33] <elukey>	 awwww
[08:44:35] <elukey>	 :D
[08:44:48] <joal>	 https://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__compaction_throughput_mb_per_sec
[08:44:49] <elukey>	 sorry jokes aside, what did you discover?
[08:45:04] <joal>	 compaction_throughput_mb_per_sec
[08:45:15] <elukey>	 ah!
[08:45:24] <elukey>	 this one might be the reason why bulk loading is so slow?
[08:45:49] <joal>	 elukey: might be the reason why compaction is so slow globally
[08:46:01] <joal>	 elukey: On a prod cluster serving data, I understand there must be a limit
[08:46:22] <joal>	 elukey: in our case, we are only loading, so maybe raising the limit could help
[08:46:57] <joal>	 elukey: recommended value is 16 to 32 times the rate of write throughput (in Mb/second)
[08:48:36] <joal>	 elukey: We have observed a write throuput of ~30MB/s on new aqs at load time, so we probably could raise the compaction_throughput_mb_per_sec to ~600
[08:50:08] <elukey>	 yes
[08:52:48] <joal>	 elukey: Do you mind checking if this parameter is already set in out config?
[08:53:13] <elukey>	 joal: I am doing it now :)
[08:53:23] <joal>	 elukey: You rock :)
[08:53:42] <elukey>	 in aqs.yaml (hieradata) we set cassandra::compaction_throughput_mb_per_sec: 60
[08:54:02] <elukey>	 more specifically
[08:54:03] <elukey>	 cassandra::compaction_throughput_mb_per_sec: 60
[08:54:03] <elukey>	 cassandra::concurrent_compactors: 10
[08:54:03] <elukey>	 cassandra::concurrent_writes: 18
[08:54:04] <elukey>	 cassandra::concurrent_reads: 18
[08:54:33] <elukey>	 it is really handy since I can change those parameters only for aqs100[456]
[08:55:42] <elukey>	 joal: --^
[08:55:48] <joal>	 elukey: yop, just saw that
[08:56:03] <joal>	 elukey: batcave for a minute to discuss those values?
[08:56:22] <elukey>	 sure, grabbing my headphones
[08:56:24] <joal>	 sure
[09:14:52] <addshore>	 elukey: there was still no otto around that other pupper patch I linked yo to yesterday :/ Is there any chance you could take another look at it? (as its blocking a few things)
[09:16:23] <elukey>	 addshore: I pinged you yesterday, the pcc failure was the absence of wmde secrets fake data in the fake private repo
[09:16:29] <elukey>	 I fixed it and now it looks good
[09:16:35] <elukey>	 we can merge if you want
[09:16:41] <elukey>	 joal:
[09:16:46] <elukey>	 Processor Information Socket Designation: Proc 2 Type: Central Processor Family: Xeon Manufacturer: Intel(R) Corporation
[09:16:58] <addshore>	 elukey: that would be great! :)
[09:17:05] <elukey>	 joal: flags HTT (Multi-threading)
[09:17:16] * addshore has no idea how all the secrets things work, all I know is I gave them to otto ;)
[09:18:17] <elukey>	 yeah so we have a puppet private repo
[09:18:19] <joal>	 ottomata is a trustable man, he'll never reveal a secret :)
[09:18:31] <elukey>	 with hieradata etc..
[09:18:37] <addshore>	 and also a fake private repo type thing for jenkins elukey ? :P
[09:18:38] <elukey>	 but the puppet compiler works in beta
[09:18:47] <elukey>	 yes!
[09:18:54] <addshore>	 ahh okay :)
[09:19:56] <elukey>	 addshore: https://gerrit.wikimedia.org/r/#/admin/projects/labs/private
[09:20:55] <elukey>	 addshore: https://puppet-compiler.wmflabs.org/3401/stat1002.eqiad.wmnet/
[09:21:48] <addshore>	 looks good! :)
[09:22:02] <addshore>	 and also seeing what the private repo looks like is also super useful!
[09:28:03] <elukey>	 addshore: just ran puppet on stat1002
[09:28:07] <elukey>	 if you want to check
[09:37:51] <elukey>	 joal: https://gerrit.wikimedia.org/r/#/c/299956/
[09:38:31] <addshore>	 thanks elukey !
[09:39:16] <grrrit-wm>	 (CR) Addshore: [V: 2] Use new config vars [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298938 (owner: Addshore)
[09:39:27] <grrrit-wm>	 (CR) Addshore: [C: 2] Use new config vars [analytics/wmde/scripts] - https://gerrit.wikimedia.org/r/298938 (owner: Addshore)
[09:51:07] <elukey>	 joal: all merged!
[09:51:17] <elukey>	 will restart instances after the compactions
[09:51:38] <joal>	 awesome elukey :)
[09:54:12] <joal>	 elukey: I think we can restart them now, but I hear your preference for waiting ofr finished compaction :)
[09:56:36] <elukey>	 yes I would prefer to do one thing at the time, just to do a cleaner job..
[09:56:47] <joal>	 elukey: sounds good :)
[10:10:26] <joal>	 Hi addshore
[10:10:32] <addshore>	 Hey joal!
[10:11:02] <joal>	 addshore: I saw your discussion with nuria_ yesterday on hardcoding table name in request
[10:11:22] <joal>	 addshore: My preference goes for parameterized :)
[10:11:32] <addshore>	 yup, it is still parameterized now!
[10:11:46] <joal>	 addshore: I have seen that, just wanted to confirm :)
[10:11:47] <addshore>	 I guess 1 good reason for it, is you could create a dummy table with a bit of data to test the job against!
[10:12:02] <joal>	 addshore: you read my mind :)
[10:19:36] <joal>	 elukey: I don't know what's up with old aqs but loading job fails consistantly timeouting
[10:29:43] <elukey>	 sigh
[10:29:56] <elukey>	 joal: going to commute, I'll check in ~15 mins :)
[10:30:03] <joal>	 np elukey
[11:05:01] * elukey back
[11:22:56] * elukey coffee and then AQS
[11:50:19] <elukey>	 joal: so we were saying, timeouts
[11:50:33] <joal>	 elukey: That's what I have observed: timeout for logging
[11:55:05] <elukey>	 joal: ops has observed some issues with 503s returned by AQS during the past days, and I have still to investigate.. I thought they were related to throttling but we should return 429
[11:55:10] <elukey>	 so might be related to that
[11:55:24] <joal>	 elukey: there might a correlation
[11:55:38] <joal>	 elukey: You first look into it?
[11:55:53] <joal>	 elukey: If so, I assign the task to you ;)
[11:55:53] <elukey>	 com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 1 responses.
[11:56:02] <elukey>	 yes sure
[11:56:17] <wikibugs>	 Analytics-Kanban: Investigate why cassandra per-article-daily oozie jobs fail regularly - https://phabricator.wikimedia.org/T140869#2479627 (JAllemandou) a:JAllemandou>elukey
[12:04:43] <mforns>	 hi milimetric :]
[12:05:32] <mforns>	 reading notes
[12:09:53] <milimetric>	 hey mforns, joining
[12:20:07] <milimetric>	 jand_wmde: I have some good-ish news for you
[12:20:19] <milimetric>	 jand_wmde: according to https://upload.wikimedia.org/wikipedia/commons/f/f7/MediaWiki_1.24.1_database_schema.svg there is this table called watchlist so I looked into it
[12:20:36] <milimetric>	 jand_wmde: it has a record for every time anyone added anything to a watchlist
[12:20:56] <jand_wmde>	 milimetric: great!
[12:21:10] <milimetric>	 but they don't have timestamps before 2012-04
[12:21:33] <milimetric>	 but the good thing seems to be that this table is never cleaned out, even when a page is vandalized by being moved
[12:21:49] <jand_wmde>	 good.
[12:22:00] <elukey>	 milimetric: sorry to bother you, I'd need an info if you know.. for analytics-$random-group access requests we'd need both managers (direct supervisor and Nuria) to sign-off right?
[12:22:11] <milimetric>	 jand_wmde: oh man, never mind, it's worse than that
[12:22:19] <milimetric>	 the timestamp field in there is super confusing: https://www.mediawiki.org/wiki/Manual:Watchlist_table
[12:22:46] <milimetric>	 jand_wmde: but that's the best historical data you'll get, as none of this is dumped.  You could restore old backups to get more, maybe, but that's a pain
[12:23:16] <jand_wmde>	 milimetric: should be ok, dont need to go very far back
[12:23:29] <milimetric>	 elukey: not sure what the process was in the past, but instead of nuria we could get anyone on our team's approval I guess, as long as they were confident approving
[12:24:00] <milimetric>	 jand_wmde: right, but I'm saying it's not clear which record happened when, a null timestamp could be a brand new watchlist addition
[12:24:34] <jand_wmde>	 milimetric: ah, ok. Can I see when something gets thrown out again?
[12:24:51] <elukey>	 milimetric: yeah makes sense
[12:24:51] <milimetric>	 jand_wmde: thrown out of the watchlist?  It doesn't as far as I can tell
[12:25:06] <jand_wmde>	 milimetric: ok
[12:25:08] <milimetric>	 jand_wmde: because I found watchlist entries for pages that have been deleted since the entry was added
[12:25:30] <jand_wmde>	 milimetric: ok, still useful. How can I access this data?
[12:25:42] <mforns>	 milimetric, oh didn't hear the ping
[12:25:43] <milimetric>	 but you could join through page to the revision table and get the page creation timestamp (the rev_timestamp in the revision record with rev_parent_id = 0)
[12:25:58] <milimetric>	 jand_wmde: on stat1003.eqiad.wmnet
[12:26:03] <milimetric>	 hey mforns - I'm in the cave
[12:26:05] <milimetric>	 (ours)
[12:26:18] <milimetric>	 jand_wmde: and with that, I gotta go, check out https://wikitech.wikimedia.org/wiki/Analytics/Data_access
[12:26:29] <jand_wmde>	 milimetric: thansk!
[12:26:33] <mforns>	 milimetric, I need 15 minutes, is it ok with you?
[12:26:39] <milimetric>	 np
[12:26:45] <mforns>	 ok, be back in short
[12:26:51] <mforns>	 I added a comment in the etherpad
[12:27:37] * milimetric curls up by the fireplace with the etherpad
[12:30:47] <addshore>	 elukey: what manager? ;)
[12:33:48] <addshore>	 per https://wikitech.wikimedia.org/wiki/Production_shell_access I see "Your manager's approval is usually not required, as you've already been granted access to the cluster; the project lead of the cluster you request access to should sign off (if in doubt, ask the Ops Clinic Duty person for the week.)"
[12:35:20] <elukey>	 addshore: probably you're right, a bit confusing
[12:36:58] <elukey>	 mmmmm
[12:37:10] <elukey>	 addshore: sorry to ask but you don't have a manager?
[12:37:35] <addshore>	 elukey: I just responded at https://phabricator.wikimedia.org/T140342#2479786
[12:37:43] <addshore>	 I do have a manager, but they are not WMF ;)
[12:38:07] <elukey>	 ahhh the joy of multiple foundations
[12:38:34] <elukey>	 ok so I'll ask nuria_ to check later on, but after that we should be ok
[12:39:00] <addshore>	 okay! And for other access requests Dan Garry has generally been our point of contact / approval person :)
[12:40:09] <elukey>	 sure sure! I am not super expert with clinic duties and it is confusing sometimes :)
[12:41:07] * addshore agrees ;)
[12:49:01] <mforns>	 milimetric, back
[12:49:04] <milimetric>	 hey
[12:49:07] <milimetric>	 I'm in the cave
[14:37:00] <mforns>	 milimetric, hangouts kicked me out and now says I'm "not allowed to join this videocall"
[14:37:06] <mforns>	 can we use https://hangouts.google.com/hangouts/_/wikimedia.org/ehr-batcave-2 ?
[14:38:03] <milimetric>	 sure, On my way!
[14:40:34] <nuria_>	 hola
[14:46:53] <grrrit-wm>	 (CR) Nuria: [C: 2 V: 2] Match title & ns using x_analytics header & get all agent_types [analytics/refinery/source] - https://gerrit.wikimedia.org/r/298724 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[14:47:23] <grrrit-wm>	 (CR) Nuria: [C: 2 V: 2] Use webrequest in wikidata/articleplaceholder_metrics [analytics/refinery] - https://gerrit.wikimedia.org/r/298726 (https://phabricator.wikimedia.org/T138500) (owner: Addshore)
[14:50:09] <wikibugs>	 Analytics, Analytics-Cluster, Analytics-Kanban, Deployment-Systems, and 3 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2480203 (elukey) >>! In T129151#2476622, @thcipriani wrote: >>>! In T129151#2474657, @elukey wrote: >> Summary: >>  >> 1) External sca...
[14:53:38] <nuria_>	 joal: were you guys able to solve the timeout issue with loading (I saw the MAGIC cassandra parameter!)
[14:53:55] <joal>	 nuria_: two different things here
[14:54:05] <nuria_>	 joal: right, right
[14:54:13] <nuria_>	 joal: those two are unrelated i know
[14:54:23] <joal>	 nuria_: We have an operational issue with current aqs - yesterday hasn't loaded, and cluster seems busy
[14:54:37] <joal>	 nuria_: And we want to test new magic stuff :)
[14:55:07] <joal>	 nuria_: about new magic, code is ready (thanks elukey !), we'll wait for current compaction set to change the params
[14:55:27] <nuria_>	 joal: all right, that should change things somewhat
[14:55:27] <joal>	 nuria_: about operational issue, elukey is investigating AFAIK
[14:55:36] <joal>	 nuria_: hopefully it'll help
[15:05:05] <elukey>	 didn't have time to follow up on the issue but it seems that we have timeouts on aqs100[123], so it might be related to the loading problem
[15:08:33] <wikibugs>	 Analytics-Kanban, EventBus, Services, User-mobrovac: Improve schema update process - https://phabricator.wikimedia.org/T140870#2480338 (Nuria)
[15:10:31] <wikibugs>	 Analytics-Kanban, EventBus, Services, User-mobrovac: Improve schema update process - https://phabricator.wikimedia.org/T140870#2479098 (Nuria) Seems that one action item here is to document deployment to EventBus (cc @Ottomata )  Is services working on the restart problem? (cc @GWicke )
[15:10:55] <wikibugs>	 Analytics-Kanban, EventBus, Services, User-mobrovac: Improve schema update process on EventBus production instance - https://phabricator.wikimedia.org/T140870#2480358 (Nuria)
[15:13:12] <elukey>	 joal: tomorrow I'll be afk and Jaime is going to do maintenance on dbstore1002 from 14:00 to 15:00 UTC.
[15:13:32] <elukey>	 just a reminder in case anybody will complain in here
[15:17:52] <joal>	 noticed elukey
[15:23:29] <urandom>	 elukey: fyi: re: https://phabricator.wikimedia.org/rOPUPd2f8d49f7cacd68f5ea17fb2457f580367e0509d
[15:23:41] <urandom>	 elukey: you can set these things ephemerally to try out settings
[15:24:05] <urandom>	 throughput is something you can set with nodetool (nodetool setcompactionthroughput <value>)
[15:24:33] <urandom>	 compactors requires a bit more fanagling, but i can show you if interested
[15:25:00] <urandom>	 concurrent reads/writes should work the same way as compactors, though I'd have to look to be certain
[15:25:18] <elukey>	 urandom: oh yes but I wanted to do things cleanly to avoid madness :D
[15:25:28] <urandom>	 elukey: ok
[15:25:34] <elukey>	 we are waiting for the current compaction to finish and then I'll restart the instances
[15:25:54] <elukey>	 I wanted to wait to do the restarts because I thought it was cleaner
[15:26:13] <elukey>	 but if you think that we can restart safely instances while compacting I can try now
[15:26:29] <urandom>	 elukey: you don't need to restart, if you don't want
[15:26:30] <joal>	 elukey: actually, would be awesome to try now since you're away tomorrow :)
[15:26:44] <urandom>	 you can change throughput on the fly
[15:26:54] <joal>	 elukey: Maybe we can change it without restarting :)
[15:27:08] <joal>	 elukey, urandom : one way or another I don't really mind :)
[15:27:17] <urandom>	 yeah, you can ... which is why i mentioned it, it might make it easier to iterate and find the sweet spot
[15:27:34] <urandom>	 entirely up to you guys though
[15:27:36] <joal>	 Sounds correct urandom :)
[15:27:43] <joal>	 elukey: your thoughts?
[15:27:53] <elukey>	 joal: I know that you want me to do it now :P
[15:27:59] <joal>	 huhuhu :)
[15:30:33] <elukey>	 urandom: I am going to use setcompactionthroughput via nodetool
[15:30:51] <urandom>	 raising throughput might not get you more throughput without also raising the number of compactors, but you can try it
[15:31:45] <wikibugs>	 Analytics, Analytics-EventLogging, Multimedia, UploadWizard: Move efSchemaValidate out of global scope - https://phabricator.wikimedia.org/T140908#2480452 (Reedy)
[15:32:30] <elukey>	 elukey@aqs1004:~$ nodetool-b getcompactionthroughput
[15:32:31] <elukey>	 Current compaction throughput: 256 MB/s
[15:32:57] <elukey>	 !log raising compaction throughput to 256 on aqs100[456]
[15:33:00] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[15:35:36] <urandom>	 elukey: OK, so a restart will put that back where it is configured
[15:36:02] <urandom>	 elukey: and, you might not see the throughput you're looking for without also add concurrency (compactors)
[15:36:24] <urandom>	 elukey: let me put together the commands for that
[15:36:28] <elukey>	 urandom: yeah, from our calculations (if they make sense) we wanted to raise the compactors to 12 (from 10)
[15:36:32] <elukey>	 sure :)
[15:36:47] <elukey>	 would it be worth to raise concurrent write/reads too?
[15:36:51] <elukey>	 not sure if possible
[15:37:21] <urandom>	 elukey: not sure
[15:38:26] <urandom>	 elukey: https://phabricator.wikimedia.org/P3520
[15:38:45] <urandom>	 you of course have to set the JMX port to match that of the instance you are trying to set
[15:39:21] <urandom>	 sjk comes from jvm-tool (that package I asked you to add)
[15:40:02] <urandom>	 raising concurrent writes/reads is a reasonable thing to experiment with, let me see if I can figure out how to do that ephermerally as well
[15:40:24] <wikibugs>	 Analytics, Analytics-EventLogging, FileAnnotations, Multimedia, UploadWizard: Move efSchemaValidate out of global scope - https://phabricator.wikimedia.org/T140908#2480482 (Reedy)
[15:43:57] <urandom>	 elukey: i might need to code-dive on this one
[15:45:19] <elukey>	 urandom: don't worry I think we have enough for the moment :)
[15:45:55] <elukey>	 !log executed https://phabricator.wikimedia.org/P3520 on aqs100[456] for both a/b cassandra instances
[15:45:58] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[15:46:10] <elukey>	 joal: raised compactors to 12 and throughput to 256
[15:46:18] <joal>	 Yay elukey
[15:46:27] <joal>	 elukey: on every machine?
[15:50:51] <elukey>	 yes
[15:51:02] <joal>	 great :)
[15:54:45] <joal>	 elukey: We have confirmation about space-reduction by compression
[15:55:09] <joal>	 elukey: Still quite a few compactions to run, and already less than 200Gb per machine
[15:55:43] * elukey dances
[15:55:54] * elukey hugs urandom
[15:56:49] <elukey>	 So if everything goes as planned we might get down to ~3.6TB for three years of data + space for compactions etc..
[15:57:00] <elukey>	 but I don't think it is enough to think about raid10
[15:57:08] <elukey>	 because we will need space anyway
[15:57:15] <elukey>	 but one thing at the time :)
[15:57:41] <joal>	 elukey: my bad for miswording: less than 200Gb per INSTANCE
[15:57:56] <joal>	 We are getting closer to expected number, but not a big win I htink
[15:59:17] <elukey>	 let's see the end result
[15:59:23] <joal>	 elukey: agreed
[15:59:25] <elukey>	 but it looks way better than before the compression change
[16:00:46] <joal>	 elukey: agreed
[16:01:48] <wikibugs>	 Analytics-Kanban, EventBus, Services, User-mobrovac: Improve schema update process on EventBus production instance - https://phabricator.wikimedia.org/T140870#2480586 (mobrovac) >>! In T140870#2480338, @Nuria wrote: > Seems that one action item here is to document deployment to EventBus (cc @Otto...
[16:03:56] <wikibugs>	 Analytics-Kanban: Productionize Druid loader - https://phabricator.wikimedia.org/T138264#2480615 (JAllemandou) CR: https://gerrit.wikimedia.org/r/#/c/298131/
[16:20:11] <wikibugs>	 Analytics-Kanban, EventBus, Services, User-mobrovac: Improve schema update process on EventBus production instance - https://phabricator.wikimedia.org/T140870#2480740 (Nuria) @mobrovac: do you have now ssh access to eventbus machines? (your team should)
[16:33:29] <milimetric>	 raid 5
[16:33:31] * milimetric runs away
[16:35:36] * elukey didn't see the message
[16:36:31] <nuria_>	 mobrovac: yt?
[17:00:39] <mobrovac>	 nuria_: am now
[17:01:33] <nuria_>	 mobrovac: regarding eventbus, are you guys doing any development on it on the near term?
[17:02:00] <mobrovac>	 nuria_: you mean the python service?
[17:02:00] <mobrovac>	 no
[17:08:09] <wikibugs>	 Analytics-Kanban, EventBus, Services, User-mobrovac: Improve schema update process on EventBus production instance - https://phabricator.wikimedia.org/T140870#2481042 (mobrovac) >>! In T140870#2480740, @Nuria wrote: > @mobrovac: do you have now ssh access to eventbus machines? (your team should)...
[17:17:57] <nuria_>	 mobrovac: and what about ssh access to service machines for event bus?
[17:18:08] <nuria_>	 mobrovac: you should have that
[17:19:19] <mobrovac>	 nuria_: i must admit i don't recall the history exactly, but it was said/decided at some point that we don't need access as we won't be maintaining the service and that we can access kafka from other nodes
[17:19:57] <nuria_>	 mobrovac: really? cause it seems that to troubleshoot issues you might want ssh. In our team we have two devops that have permits to all
[17:20:12] <nuria_>	 mobrovac: but all for us have ssh to be able to see what is going on everywhere
[17:20:22] <nuria_>	 mobrovac: event if only otto and luca can restart teh cluster
[17:20:24] <nuria_>	 *the
[17:20:40] <nuria_>	 mobrovac: is eventbus tier-2?
[17:21:24] <mobrovac>	 that's because 99% of your machines are in a different vlan than prod
[17:21:28] <mobrovac>	 (the access)
[17:21:40] <mobrovac>	 so i guess you got access to kafka100x by default
[17:21:51] <mobrovac>	 and/or they are regarded as analytics machines
[17:22:50] <nuria_>	 mobrovac: no no EL machines are in prod and so are kafka and soare the dbs
[17:23:43] <mobrovac>	 dunno what to tell you ...
[17:23:45] <nuria_>	 mobrovac: we have groups in puppet for specific servcies to manage that
[17:23:48] <nuria_>	 *servcies
[17:23:52] <mobrovac>	 our team is treated differently apparently
[17:24:01] <nuria_>	 but anyways, is Eventbus a tier-2 servcie?
[17:24:06] <nuria_>	 *service
[17:24:17] <nuria_>	 can it be down for say a day?
[17:24:44] <nuria_>	 ^ mobrovac
[17:25:02] <mobrovac>	 no, no way
[17:25:08] <mobrovac>	 it can't be down for even a minute
[17:25:11] <mobrovac>	 as we lose events then
[17:25:52] <nuria_>	 mobrovac: that is tier-1 and so you know our team cannot support tier-1 by itself, we are not stuffed for that. i will talk to otto but maintenance has to be done together with ops
[17:26:20] <nuria_>	 tier-1 requires a 24/7 rotation which we do not have neither will we have, as analytics services are tier-2
[17:27:13] <mobrovac>	 yes, it makes sense that maintenance on it is done in sync with ops
[17:31:01] <nuria_>	 mobrovac: what are teh issues when eventbus is down?
[17:31:05] <nuria_>	 *the
[17:31:57] <mobrovac>	 nuria_: all of the eventbus system (the python service + kafka + change-propagation and other consumers) relies of mediawiki being able to get its events into it
[17:32:04] <mobrovac>	 nuria_: if the python service is down, mediawiki can't do that
[17:32:14] <nuria_>	 mobrovac: right, and what happens
[17:32:14] <mobrovac>	 and we lose events
[17:32:20] <nuria_>	 and ..?
[17:32:23] <mobrovac>	 and so, don't have reliable event delivery
[17:32:28] <nuria_>	 and ?
[17:32:35] <nuria_>	 what part of our system is affected?
[17:32:44] <mobrovac>	 and we miss page edits and other relevant info
[17:32:59] <mobrovac>	 the receiving end
[17:33:05] <mobrovac>	 restbase doesn't get updated
[17:33:09] <mobrovac>	 nor do mobileapps
[17:33:11] <mobrovac>	 and other services
[17:33:13] <nuria_>	 mobrovac: what party misses edits? (not the db)
[17:33:13] <mobrovac>	 etc etc etc
[17:33:25] <mobrovac>	 restbase for now
[17:33:35] <mobrovac>	 but that will extend to RCStream soon
[17:33:38] <mobrovac>	 and other consumers
[17:34:10] <nuria_>	 mobrovac: and resbase provides edit data for who?
[17:34:20] <nuria_>	 mobrovac: mobile apps only?
[17:34:47] <mobrovac>	 nuria_: restbase gets updated on each page edit, if the event is missed, restbase starts delivering old content to consumers which violates the contract
[17:34:57] <mobrovac>	 same goes for mobileapps
[17:35:29] <nuria_>	 mobrovac: right, i understand i am trying to see who are the consumers, not the desktop and mobile website correct?
[17:35:36] <nuria_>	 mobrovac: so mobile apps and who else?
[17:35:52] <mobrovac>	 nuria_: VE, android native app users
[17:35:59] <mobrovac>	 so yes, desktop users will be affected
[17:36:20] <nuria_>	 mobrovac: right, mobile apps and editors that use VE, anyone else?
[17:36:23] <mobrovac>	 editors that VE, not readers though
[17:36:41] <mobrovac>	 other clients that rely on the REST API, such as google
[17:37:39] <nuria_>	 ok, so that makes android app, desktop editors using VE and clients to rest api, anyone else?
[17:38:11] <nuria_>	 mobrovac: only some parts of android app though
[17:38:12] <mobrovac>	 not that i know of, at this point
[17:38:30] <mobrovac>	 no, they switched to rb-based display
[17:38:56] <nuria_>	 mobrovac: ok, will consult with ops if these are tier-1
[17:39:36] <nuria_>	 mobrovac: not sure of our classification
[17:39:46] <mobrovac>	 k
[17:40:22] <mobrovac>	 nuria_: note that this classification is only temporary in the case of eventbus, as it gets used more and more its importance will grow
[17:40:48] <nuria_>	 mobrovac: what is the contingency plan if service goes down due to machine failure?
[17:40:57] <nuria_>	 mobrovac: can you guys bootstrap events from db?
[17:41:41] <mobrovac>	 we have two machines there, so one ought to stay up
[17:41:59] <mobrovac>	 nuria_: but we do have ways to get the content in
[17:43:19] <nuria_>	 mobrovac: will also be good to document those in administration page
[17:43:52] <nuria_>	 mobrovac: if what happen yesterday would had happen on Friday we probably would have needed to get the content somehow
[17:44:55] <mobrovac>	 yes
[17:45:05] <mobrovac>	 luckily, it happened only for mediawikiwiki and test2wiki
[17:45:18] <mobrovac>	 so not a big loss
[17:45:52] <nuria_>	 mobrovac: sure, please document procedure anyways, ok?
[17:46:16] <mobrovac>	 k
[19:40:10] <joal>	 !log Relaunch 2016-07-19 cassandra per-article-daily oozie job
[19:40:13] <analytics-logbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master
[20:56:44] <grrrit-wm>	 (PS6) Mforns: [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861)
[20:58:02] <mforns_>	 milimetric, that's it ^ :]
[20:58:21] <mforns_>	 good night a-team, and have a nice weekend, see you on wednesday next week!
[20:58:47] <grrrit-wm>	 (CR) jenkins-bot: [V: -1] [WIP] Process MediaWiki User history [analytics/refinery/source] - https://gerrit.wikimedia.org/r/297268 (https://phabricator.wikimedia.org/T138861) (owner: Mforns)
[22:01:15] <wikibugs>	 Analytics, Analytics-EventLogging, FileAnnotations, Multimedia, and 2 others: Move efSchemaValidate out of global scope - https://phabricator.wikimedia.org/T140908#2480452 (matmarex) Hold on… is efSchemaValidate() a generic function to validate some JSON data against some JSON schema? Why is it i...
[22:04:07] <wikibugs>	 Analytics, Analytics-EventLogging, FileAnnotations, Multimedia, and 2 others: Move efSchemaValidate out of global scope - https://phabricator.wikimedia.org/T140908#2482213 (Reedy) >>! In T140908#2482208, @matmarex wrote: > Hold on… is efSchemaValidate() a generic function to validate some JSON da...