[02:41:35] Nemo_bis: the discovery team takes care of that dashboard, so I'm not sure what direction to point you in [05:39:44] It's ok, just linking for curiosity [06:03:44] goood morning o/ [06:03:59] joal: I can see that aqs1004-b has finally finished compacting [06:04:11] so I am going to re-image aqs1005 as soon as I get into the office [06:20:15] elukey: o/ [06:20:39] Let's discuss before moving forward, still something weird in there (data size on 1004-b [06:33:48] jurbanik: As nuria pointed out, change has been announced some time back (https://www.mail-archive.com/analytics@lists.wikimedia.org/msg03594.html - We even waited more than planned) [06:34:00] jurbanik: And old data will not be removed [06:37:44] elukey: found the issue on aqs1004-b : Problem of code manually updated [06:37:59] elukey: CPU usage is not cassandra but restbase failing to start [06:45:16] Analytics-Kanban: Stop generating pagecounts-raw and pagecounts-all-sites - https://phabricator.wikimedia.org/T130656#2531978 (JAllemandou) Everything looks good, email sent. [06:48:00] joal: you are right, it was stopped because of a scap deployment issue but then we the releng team fixed it restbase tried to start itself [06:48:29] elukey: hm, didn't get it [06:48:38] yeah sorry [06:49:31] puppet was failing due to a scap deployment issue, therefore aqs wasn't turned up. But when Tyler fixed it for the refinery he triggered a deployment of the scap package everywhere [06:49:44] that probably fixed the puppet issue on aqs1004 triggering restbase restarts [06:50:22] elukey: right [06:50:38] elukey: How do we hande that ? [06:50:49] elukey: stop restbase, or change codebase and restart restbase? [06:54:22] joal: I still don't know how to fix the aqs100[456] restbase issue, I remember that we had a chat about it.. If there is a workaround let me know, we can even disable puppet for a while [06:55:11] elukey: first, one confirmation: restbase is down on aqs1004-a, correct? [06:55:50] elukey: issue comes from code diff from the version we deploy using scap and the version needed by restbase on new cluster [06:58:39] yes it is disabled, but I needed to disable puppet too otherwise it would try to bring aqs up again [07:00:16] elukey: we can update code manually and have restbase starte [07:03:23] super [07:04:16] elukey: we acn take eample from aqs1005 ;) [07:05:11] sorry to ask again but is there a way to fix this issue in a way that deploying to aqs1004 fixes the issue without manual code changes? [07:05:32] it looks weird to me that we have to live hack hosts to make everything working [07:05:47] elukey: two clusters, two code bases, one repo ... I guess your answer is no :) [07:05:50] I know that we have differences with aqs100[123] but there might be a way to have both settings? [07:06:04] all right, answer before the question [07:06:08] I wanted too much magic [07:06:10] :) [07:06:10] :) [07:06:37] elukey: feasible - duplicate code for article data, then rename after (but rename is a pain in cassandra) [07:06:56] so, easier to actually manually change for loading time I think [07:07:25] can you do it now on aqs1004 or do you need extra permissions? [07:07:31] I don't recall what we did last time [07:07:53] elukey: let's check aqs1005 [07:08:47] elukey: in /srv/deployment/analytics/aqs/deploy/src, git diff HEAD [07:09:13] ahh now I recall [07:09:19] :) [07:09:21] you made me remove that bit of code [07:09:37] all right taking some notes so I'll be able to do it for aqs100[56] [07:10:08] elukey: only needed for aqs1004 for now, but for after restart, needed :) [07:10:56] yes yes [07:13:01] 09:12 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [07:14:55] great :) [07:15:17] super weird, aqs1004-b is 350GiB rather than 280ish as the other instances [07:15:23] elukey: indeed [07:15:29] elukey: couldn't it be log related? [07:16:00] joal: log related to aqs or cassandra? [07:17:12] elukey: don't know ... I'd say restbase, but normally everything is sent to logstash ... [07:17:42] yeah [07:17:44] root@aqs1004:/srv/cassandra-b/data# du -hs [07:17:44] 360G . [07:18:37] more precisely [07:18:39] 360G ./local_group_default_T_pageviews_per_article_flat [07:18:39] 360G ./local_group_default_T_pageviews_per_article_flat/data-95eadc90484111e6875bd593012fb45b [07:21:44] elukey: :( [07:23:18] joal: SSTable Compression Ratio: 0.24123920771965465 [07:23:31] this is nodetool-b tablestats [07:24:04] sorry maybe that was the wrong table [07:24:08] checking again [07:24:16] but could it be the compression settings? [07:25:44] I am checking diffs between local_group_default_T_pageviews_per_article_flat on a and b [07:26:20] elukey: number of keys differs a lot [07:26:36] meaning -b has a lot more data than -a [07:26:41] This is a bit weird to me [07:27:38] I was about to say the same [07:27:50] Number of keys (estimate): 346103704 and Number of keys (estimate): 679768001 [07:28:43] I am wondering if bootstrapping two instances at the time could bring these weird side effects [07:28:54] but theoretically the answer should be no [07:29:41] elukey: Don't know what to think :( [07:33:49] elukey: let's move forward and maybe ask urandom, or try reinstanciate aqs1004-b? [07:36:16] joal: I think that the procedure that we are following may have messed up the token distribution.. I'd like to spend a bit of time in understanding why aqs1004-b has more keys, even if nodetool status does not show anything unbalanced.. If we don't find any good solution it might be safer and quicker to just nuke all the nodes and restart from a clean start [07:36:47] elukey: ok [07:36:52] I am afraid that if we don't know exactly what we are doing we might get huge issues in the future when we switch [07:38:49] it might be that aqs1004-b has a lot of "unused" waiting to be cleaned? [07:39:02] (I am reasoning out loud, might be all garbage) [07:39:34] elukey: hwat doesn unused mean? [07:43:42] I am wondering if bootstrapping two instances might have confused how the tokens were assigned, ending up (for some reason) to a temporary assignment of more tokens to aqs1004-b. Eventually the ring balanced and aqs1004-b didn't delete tokens that it is not using [07:43:58] a bit of a stretch but I am not sure how cassandra behaves in these situations [07:44:12] elukey: one way to know that is call a repair on aqs1005-b [07:44:20] elukey: aqs1004-b sorry [07:47:09] yeah I was trying to read what is the best way to solve these situations [07:47:19] ahhh cassandra, sometimes I hate you a bit [07:48:31] I started nodetool-b cleanup [07:49:18] k elukey, we'll monitor :) [07:50:08] yesssssss [07:50:12] it is deleting a lot [07:50:33] or at least the trend seems good [07:50:36] :) [07:50:36] fingers crossed [08:20:40] -30GB for the moment, it looks very promising.. if this turns up to be the solution I'd try to bootstrap only one instance at the time [08:22:57] elukey: k [08:44:06] joal: I am going to re-image aqs1005, I think that we are good to go now [08:44:30] elukey: clening done? [08:44:46] almost, but the node is up when the cleanup happens [08:44:53] elukey: okey [08:45:11] I'll wait another half an hour just in case [08:45:15] then I'll re-image.. [08:45:22] elukey: as you wish :) [08:45:45] it looks good now, aqs1004-b is not 294GiB [08:45:48] much better :) [08:45:54] *now [08:46:01] elukey: possible reason for the thing is aqs1005-b receiving both data for a and b, but the weird thing is that it looks that a didn't receive data for b :( [08:47:32] yeah it seems that the ring-token-mess happened only for the b instance, not sure why.. it explains also why it took so much time [08:47:43] yes [08:47:45] weird [08:49:46] joal: totally unrelated but it might interest you - https://github.com/wikimedia/mediawiki/blob/d5c40c584a581f1dc6a184617d6bfd1629466bc9/includes/objectcache/ObjectCache.php [08:49:48] Analytics-EventLogging, DBA, ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#2532122 (jcrespo) @Jdforrester-WMF It seems nobody complained, ok to irreversibly delete these tables? [08:50:18] it should be the various abstractions around mediawiki caches [08:50:30] (APC, mysql and memcached) [08:50:41] elukey: I can't read php, my eyes bleed too much [08:50:52] :D [08:52:13] I have the same problem! But it is fashinating [08:52:23] the abstraction looks really nice [08:52:28] elukey: why is it? [08:53:29] I am not sure how well this class hides low level details about where an object that you want to store goes [08:54:22] but from a quick view it seems that mw developers calls this class and specify the "kind" of cache that they need (local/fast on host, persisted on the same DC, replicated, etc..) [08:54:38] elukey: nice :) [08:54:56] and the object then ends up in mysql, memcached or APC [08:55:02] transparently afaiu [08:55:04] very nice [09:47:10] aqs1004-b's deletions seems not stopping :D [09:47:16] now 235GiB [09:51:31] it seems going into each db file and clean up unused keys [09:51:34] one by one [09:51:49] (files not keys) [10:16:14] ok I am bootstrapping cassandra-a on aqs1005 [10:16:42] but cassandra-b's sstables on aqs1004 are still deleting keys [10:16:43] sigh [10:31:05] joal: I am going to leave things going for today, but I suspect that the best path forward (and the quickest) is to wipe everything [10:31:11] and restart loading from scratch [10:31:17] we are loosing too much time with this [10:31:23] even if we are learning a lot of things [10:31:54] elukey: I'd rather not change strategy every week ) [10:33:14] joal: yes but the agreement was that if the time to re-load data was less than realoading we'd have continued [10:33:21] this is not happening :) [10:33:51] plus the cleanup looks weird, the sstable dropped below 200GiB [10:34:40] elukey: your call, reloading is not an issue :) [10:37:33] joal: I don't mind to take the call but it was more an open question rather than a statement [10:37:39] :) [10:37:47] I am following what we discussed last week, that's it [10:38:22] Now that we have invested some time into trying to bootstrap, I think the learning we are having is interesting [10:38:46] elukey: Imagine adding a few nodes while cluster in production and having those kind of issues [10:40:40] so Eric already discouraged what I am doing to bootstrap instances with live traffic, because he said that it is safer to decom an instance (so data shuffling + cluster rebalance) to then re-add it in the ring. Adding a new instance only is easier and it shouldn't cause this amount of pain :) [10:41:23] what I'd like to do it is balance results with knowledge [10:41:37] otherwise the new AQS cluster will go nowhere :( [10:42:39] nodetool-b cleanup finished on aqs1004 [10:42:43] good news [10:42:56] 173G ahhahaha [10:43:01] joal: --^ [10:43:12] elukey: interesting info is maybe doing regular cleaning can gain us some space :) [10:43:46] I cleaned all the instances and this one is the only one that dropped so heavily :/ [10:44:41] elukey: you launched clean on every instance this morning? [10:45:06] one at the time, before starting aqs1004-b [10:45:14] but they completed very quickly [10:45:28] elukey: a repair might be good for aqs1004-b I think [10:45:53] I think my brain will need a repair if I keep going in this way :P [10:46:37] elukey: and now that bottstrap of 5-a has started, I think it's wiser to actually stop and reload --> we might end up with data corruption, since a node is down (5-a), and one is possibly corrupted [10:47:07] elukey: going with the process we used, we should have: done one in [10:47:24] - bootstrap instances one by one [10:47:36] - If issues, solve the issue before bottstrapping again [10:48:08] multiple actions involving data reconstruction at the time seems not to work for cassandra [10:48:20] yes it makes sense [10:49:56] so I have stopped aqs1005-a and now I am going to run nodetool-a repair [10:50:16] sorry nodetool-b repair on aqs1004 [10:50:24] reading from the docs it should be enough [10:50:36] elukey: wrong idea - if 4-b and 5-a share some data, there are chances we'll corrupt data [10:50:57] cause there is only one source, so no possibility of quorum [10:52:17] 5a is down and not having any data, how can it possibly corrupt anything? [10:52:29] 4-b will repair using also the other replica [10:52:48] or maybe I am not following [10:53:26] elukey: replication factor is 3, so if 4-b and 5-a shared some data, now that 5-a is down, only one other instance has the same data - how to know who (from 4-b and the other) has the correct one? [10:57:28] 4-b is missing some data probably, the other node should be able to provide the missing pieces [10:57:33] elukey: in other works, there is a possiblity of not being able to have qorum (given that 5-a is down) [10:57:34] this is my assumption [10:58:22] what is your suggestion? [10:58:35] reload [10:58:50] ah ok, so completely wipe [10:59:09] having 2 instances possibly sharing data corrupted doesn't allow us to repair [10:59:37] could lead to inconsistencies yes [10:59:57] elukey: in order to repair, we should have had put 5-a down by downgrading [11:00:19] elukey: this way, cassandra would have had re-sent data in order to reach quorum 3 for the bits of data 5-a owned [11:00:52] but currently, there are only 2 instances having data that has been wiped from 5-a [11:01:13] mmm probably I would have needed to wait for the cleanup, and leave 5-a up. I thought that cleanup was completing and that it would have been a safe op, but I was wrong [11:01:31] if possibly-corrupted-4-b is one of them, then cassandra can't decide wichi instance owns the truth [11:02:01] elukey: 't's'ok, learning :) [11:02:22] I'd need to check the repair techniques first because there might be the chance that Cassandra knows what to do, but I am a bit ignorant so I'll follow your advice [11:02:22] elukey: and now we have a good reason to fully wipe and reaload ;) [11:02:27] yeah [11:02:28] :) [11:02:42] ok proceeding, you should be able to load again soon [11:02:52] elukey: maybe you can tell cassandra that a node is possibly corrupted [11:02:55] elukey: WAIT [11:03:08] elukey: let's read and see if we can figure out a way of repairing [11:04:13] I was reading read repairs docs :) [11:04:53] precisely http://www.datastax.com/dev/blog/advanced-repair-techniques [11:08:53] but even if we have a way to hint would we trust the data stored/repaired? [11:09:12] I actually don't know [11:10:26] thinking out loud, the cleanup has reviewed all the db files and deleted keys that it thought not needed by the instance. The other replica should be able to correct this when they compare their status [11:10:33] elukey: every post I read about getting back consistency uses cassandra backup/restore [11:11:43] elukey: but we don't have enough space to enable that [11:11:53] maybe we can try the repair and see how it goes, as a learning experiment. I am not sure that we'll be able to trust the current data [11:12:07] so with possibility of quorum being lost and corrupted dzata associated, I think reload is the only way [11:12:48] elukey: completely feasible [11:12:53] all right, can I proceed? [11:12:58] please :) [11:13:29] elukey: I think it'll be interesting to follow what repair says, and try to understand if it fails for instance due to not being able to reach quorum [11:14:03] elukey: I could take my break now, is that ok for you? [11:14:23] Did not get positive replies from all endpoints. List of failed endpoint(s): [11:14:54] so two instances down are a serious problem [11:15:01] elukey: for sure it is [11:15:18] wow I didn't think it would have been so bad [11:15:22] elukey: but it's different ahving 2 instances down and having 2 instances corrupted ! [11:15:41] sure sure, byzantine failures are always a sneaky issue :D [11:15:52] ok so joal I am going to re-image 1006 and wipe all [11:15:57] confirm? [11:16:15] elukey: 2 instances down - you can't write, and if at least one startup without corruption, things are ok [11:16:28] cause you can repair the corrupted one [11:16:41] yep yep [11:16:44] 2 instances corrupted - WRONGGGGGG [11:16:49] DDIIIING [11:16:50] Please wipe it all :) [11:16:53] game over [11:16:54] :P [11:16:57] DING DING SING [11:16:58] all right, have a good break [11:16:59] Indeed [11:16:59] ahahhahaha [11:17:20] elukey: By chance we can reload reasonably easily [11:17:35] Taking my break, will start loading when I get back I guess ;) [11:17:46] elukey: You should actually wipe it all, right ;) [11:18:03] 1004 as well [11:18:06] starting afresh [11:18:25] sure sure [11:22:17] going to eat something while 1006 is reimagining [11:58:57] Analytics-Tech-community-metrics: Deployment of Gerrit Delays panel for engineering - https://phabricator.wikimedia.org/T138752#2532785 (Aklapper) p:Triage>Normal [12:33:51] joal: cluster up and running [12:34:05] I think we'd need to add fake data [12:34:15] (to pass the health checks) [12:34:22] but everything is up and running [12:35:13] I am going to add a summary of what has happened to the phab task [12:35:17] and I'll inform the team [12:36:29] (the raid10 arrays on aqs100[45] are re-syncing since I wiped all the data) [12:59:19] Analytics-Kanban: Replace RAID0 arrays with RAID10 on aqs100[456] - https://phabricator.wikimedia.org/T142075#2532948 (elukey) TL;DR: -------------- Today we decided to finish the hosts reimage and wipe the whole cluster to avoid possible data corruption issues. Long explanation: -------------------- Sta... [12:59:35] hope that the explanation is good enough [12:59:37] :) [13:14:59] a-team going afk, just sent an email o/ [13:16:02] Hey joal & milimetric [13:16:22] I'd like to push the live systems meeting back 1 week and then continue on our biweekly cadence. OK? [13:16:24] hey halfak [13:16:34] ok with me [13:22:58] Great. Thanks milimetric. joal? [13:55:20] milimetric: have you seen the wikitech-l convo "Loosing the history of our projects to bitrot"? [13:55:32] hey halfak, ok with me as well :) [13:55:42] Cool. moving! [13:56:02] ottomata: no, sounds very relevant, looking up [13:56:57] cool ja, i think its mostly about content, but was thinking you might want to chime in about future of edit history stuff [14:06:12] !log Updating cassandra compaction to deflate on newly wiped cluster [14:06:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:09:59] !log Adding test data onto newly wiped aqs cluster [14:10:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:10:11] hey yall, am planning on doing an analytics eventlogging deploy this morning [14:10:14] a-team ^ [14:10:19] lemme know if there are any objections [14:10:41] cool with me [14:11:07] ottomata: no problemo [14:13:21] !log Loading 2016-06 in clean new aqs [14:13:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:30:33] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/reportupdater] - https://gerrit.wikimedia.org/r/303340 (owner: Milimetric) [14:35:17] Analytics-EventLogging, DBA, ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#2533133 (Jdforrester-WMF) >>! In T141407#2532122, @jcrespo wrote: > @Jdforrester-WMF It seems nobody complained, ok to irreversibly d... [14:44:49] hey milimetric [14:46:45] hey joal [14:46:47] what's up [14:47:07] milimetric: looking at scaling history rebuild using your sqoops [14:47:32] milimetric: I'll have some questions / ideas when you'll have time [14:48:01] omg joal I just accidentally deleted the directory I put them :( [14:48:09] there's no undoing rm -r is there? [14:48:14] omg, so stupid... [14:48:20] hm hm, I don't think there is [14:48:22] joal: batcave? [14:48:24] sure [15:12:21] Analytics-Kanban: User history: Fix the oldUserName and newUserName in blocks/groups log events - https://phabricator.wikimedia.org/T141773#2533402 (Nuria) Open>Resolved [15:13:16] !log deploying eventlogging/analytics - kafka-python 1.3.0 for both consumers and producers [15:13:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [15:13:27] (PS14) Joal: [WIP] Refactor Mediawiki History scala code [analytics/refinery/source] - https://gerrit.wikimedia.org/r/301837 (https://phabricator.wikimedia.org/T141548) [15:15:44] joal: ok, cave again? [15:16:04] yup milimetric, OMW ! [15:18:24] milimetric, joal, are you guys going to dicuss edit history? [15:18:46] come on in, mforns [15:18:50] ok [15:20:22] Analytics-Cluster, Analytics-Kanban, Deployment-Systems, scap, and 2 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2533451 (Nuria) Open>Resolved [15:21:07] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533453 (BBlack) So, before seeing this ticket I hadn't been looking at the URL/hostname patterns of these requests. Now that I am: In the US, we're seeing the... [15:48:57] Analytics-Dashiki: Mediawiki storage package shoudl request files with 1 hour ttl - https://phabricator.wikimedia.org/T142395#2533536 (Nuria) [15:49:15] Analytics, Analytics-Dashiki: Mediawiki storage package should request files with 1 hour ttl - https://phabricator.wikimedia.org/T142395#2533548 (Nuria) [15:56:55] milimetric: can you chimein on the comment here [15:56:56] https://gerrit.wikimedia.org/r/#/c/301284/14/jsonschema/mediawiki/page/delete/1.yaml [15:57:04] about required user_groups? [15:57:06] sorry [15:57:09] k [15:57:12] required performer.is_bot [15:57:22] oh [15:57:24] acutally [15:57:25] actually [15:57:34] naw, i think i need you more on the rev_count field comment [16:00:29] mforns: stadduppp [16:35:08] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533764 (Nuria) @Amire80 : what problem are these spikes causing you? The spikes represent real traffic, not per se "user initiated requests" [16:40:43] Analytics, Analytics-Dashiki: Mediawiki storage package should request files with 1 hour ttl - https://phabricator.wikimedia.org/T142395#2533536 (Milimetric) p:Triage>Normal [16:44:34] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533804 (Amire80) >>! In T141506#2533764, @Nuria wrote: > @Amire80 : what problem are these spikes causing you? The spikes represent real traffic, not per se "u... [16:46:21] Analytics, Analytics-Dashiki: Improve initial load performance for dashiki dashboards - https://phabricator.wikimedia.org/T142395#2533834 (Milimetric) [16:47:32] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533839 (BBlack) If we're looking to reduce impact on global statistics interpretation, simply filtering out all requests which have a User-Agent string containi... [16:47:57] Analytics, MediaWiki-API, RESTBase-API, Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#2524050 (Milimetric) p:Triage>Normal [16:49:06] Analytics, Pageviews-API: Pageview API Capacity Projections when it comes to storage - https://phabricator.wikimedia.org/T141789#2533851 (Milimetric) p:Triage>Normal [16:49:37] Analytics, Operations, Ops-Access-Requests: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2522728 (Ottomata) This was talked about in ops meeting today. Ops would prefer that we create another group, `deploy-aqs` perhaps,... [16:50:56] Analytics, Operations, Ops-Access-Requests: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2533860 (Ottomata) Hm, there is already an `aqs-users` group. Should we reuse it for this? [16:50:59] Analytics, Monitoring, Operations: Switch jmxtrans from statsd to graphite line protocol - https://phabricator.wikimedia.org/T73322#772448 (Milimetric) @elukey we put this on Q2, let me know if it should be earlier. [16:52:15] Analytics-Kanban, Spike: Spike - Slowly Changing Dimensions on Druid - https://phabricator.wikimedia.org/T134792#2533865 (Milimetric) p:Triage>Normal [16:52:19] Analytics-Kanban: Browser dashboard blogpost - https://phabricator.wikimedia.org/T141267#2533866 (Milimetric) p:Triage>Normal [16:52:30] Analytics-Kanban: Continue New AQS Loading - https://phabricator.wikimedia.org/T140866#2533867 (Milimetric) p:Triage>Normal [16:52:35] Analytics-Kanban: Scale scala algorithms using graph partitioning - https://phabricator.wikimedia.org/T141548#2533868 (Milimetric) p:Triage>Normal [16:52:39] Analytics-Kanban: Productionize edit history extraction for all wikis using Sqoop - https://phabricator.wikimedia.org/T141476#2533869 (Milimetric) p:Triage>Normal [16:52:47] Analytics-Kanban: User history: Adapt the user history reconstruction to use scaling by clustering - https://phabricator.wikimedia.org/T141774#2533870 (Milimetric) p:Triage>Normal [16:52:54] Analytics-EventLogging, Analytics-Kanban, EventBus, Patch-For-Review: Change or upgrade eventlogging kafka client used for producing - https://phabricator.wikimedia.org/T141285#2533871 (Milimetric) p:Triage>Normal [16:52:59] Analytics-Kanban: EventBus Maintenace: Fork child processes before adding writers - https://phabricator.wikimedia.org/T141470#2533872 (Milimetric) p:Triage>Normal [16:53:04] Analytics-Kanban, EventBus, Services, User-mobrovac: Improve schema update process on EventBus production instance - https://phabricator.wikimedia.org/T140870#2533873 (Milimetric) p:Triage>Normal [16:54:06] Analytics: Pageview API demo doesn't list be-tarask - https://phabricator.wikimedia.org/T119291#2533878 (Milimetric) Open>Resolved a:Milimetric looks like it works in the new tool: http://tools.wmflabs.org/pageviews/?project=be-tarask.wikipedia.org&platform=all-access&agent=user&range=latest-20&p... [16:55:31] Analytics, Security-Reviews: Security review of Analytics Query Service - https://phabricator.wikimedia.org/T114918#2533896 (Milimetric) Open>Resolved a:Milimetric @dpatrick if you think this is still needed, please reopen or ping us. The service has been up for a few months and we reviewed... [16:57:19] Analytics, Datasets-General-or-Unknown: UploadWizard dataset is empty, limn has no data - https://phabricator.wikimedia.org/T112851#2533911 (Milimetric) Open>Invalid no longer relevant, metrics available in new dashboard: https://edit-analysis.wmflabs.org/multimedia-health/ [16:58:19] joal: o/ [16:59:04] elukey: \o [16:59:35] Analytics: hourly pageview dumps can contain empty title - https://phabricator.wikimedia.org/T90629#1063642 (Milimetric) We no longer maintain these datasets, please take a look at the new pageviews dataset: https://dumps.wikimedia.org/other/analytics/ and specifically https://dumps.wikimedia.org/other/pagev... [16:59:48] Analytics: hourly pageview dumps can contain empty title - https://phabricator.wikimedia.org/T90629#2533942 (Milimetric) Open>declined [17:00:26] joal: all good with the new cluster? Not sure if you had the chance to double check [17:00:55] elukey: Data is loading, everything ok so far :) [17:01:21] \o/ [17:01:22] elukey: I had to change the compression, and added the test data [17:01:49] elukey: only weird thing is that some test files are not present (in order to run icinga test) [17:01:50] I didn't remember if you put them somewhere or if they were on a ticket [17:02:29] (the instructions about compression and test data) [17:02:31] elukey: They should be here by default [17:03:02] Analytics: Better publishing of Annotations about Data Issues - https://phabricator.wikimedia.org/T142408#2533952 (Milimetric) [17:03:04] ah, test data: /srv/deployment/analytics/aqs/deploy/scripts [17:04:02] mmm does scap pull the code when puppet installs/configure it for the first time? [17:04:20] probably yes otherwise I don't explain how I could have modified pageviews.js :D [17:04:37] it should on a clean install, yes [17:05:19] thanks bd808 :) [17:07:25] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533974 (Milimetric) @Amire80 so we could try to clean up this data in our pageview data pipeline, but it would be a *lot* of effort. Also, I'm not at all sure... [17:07:42] joal: tomorrow would you mind to refresh to me how to load data to cassandra? I know that the answer is oozie :) [17:07:52] no problem elukey :) [17:08:05] thanks! [17:08:20] all right all is proceeding well, I wanted to double check [17:08:37] Analytics, Operations, Traffic: Correct cache_status field on webrequest dataset - https://phabricator.wikimedia.org/T142410#2533998 (Nuria) [17:08:40] joal: last step is to change the user security settings, we might do it after one or two loads [17:08:52] in the meantime, I'll prepare the puppet changes [17:09:20] sure [17:12:59] milimetric: Heya [17:18:19] milimetric: I moved enwiki data from your folder to mine (using partition structure) to test with mforns [17:18:34] milimetric: Let me know if it causes any issue [17:20:14] all right logging off (again), have a good day/evening people :) [17:24:30] elukey: have a good evening ! Tomorrow [18:20:12] Analytics, Analytics-Cluster: Create new analytics cluster / hadoop role for mediawiki vagrant - https://phabricator.wikimedia.org/T115707#2534354 (Ottomata) [18:23:32] nuria_: , want to do the kafka broker bounce soon [18:23:38] ottomata: sure [18:23:38] yt? want to watch logs with me? [18:24:00] ottomata: sure, i am at library but hopefully i can do video, let me get headset [18:24:14] ottomata: shoudl we tail them on EL box or eventbus box? [18:24:33] *should [18:24:33] ah let's just to IRC [18:24:34] its ok [18:24:39] EL box [18:24:42] ok [18:24:43] this is just for analytics eventlogging [18:26:10] ottomata: on /var/log/upstart? [18:26:13] up [18:26:14] yup [18:26:25] i'm looking at /var/log/upstart/eventlogging_processor-client-side-00.log and /var/log/upstart/eventlogging_consumer-mysql-m4-master-00.log [18:26:29] just to reduce noise [18:26:30] but any [18:26:35] eventlogging_*.log should be active [18:27:05] i'm also tailing the log files in /srv/log/eventlogging [18:27:11] and just watching the msgs/sec of those [18:27:14] with pv -l > /dev/null [18:28:55] ottomata: forgot about pv! [18:29:29] k nuria_ ready? [18:29:39] i'm going to stop kafka broker on 1013 [18:29:44] k [18:29:44] see what happens [18:29:46] then start it [18:29:48] see what ahpepns [18:29:52] then do a leader election [18:29:54] and see what happens :) [18:29:56] k doing [18:30:16] !log restarting kafka broker on kafka1013 to test eventlogging leader rebalances [18:30:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [18:31:04] interesting. [18:31:33] ottomata: me no see nothing on processor log [18:31:37] no? [18:31:38] i see lots [18:31:49] it looks like it settled [18:31:51] but it took a bit [18:32:58] hm, ok [18:33:00] starting 1013 [18:34:16] ok nuria_ doing an election [18:34:27] ottomata: to verify , is this: deployment-eventlogging03 [18:34:30] nono [18:34:33] this is in prod [18:34:35] eventlog1001 [18:35:24] ottomata: ahahaha [18:35:30] :) [18:35:37] ok, good news: [18:35:42] nothing died and everything kept working [18:35:43] WOOHOO [18:35:48] bad news: i think i see some dropped messages. [18:35:53] it ran out of retries. [18:36:06] metadata changes weren't propagated fast enough, i guess. [18:36:07] hm. [18:36:13] we could up the number of retries for analytics [18:36:20] not a bad idea since this is totally async and auto commit [18:36:26] not sure it would help [18:36:30] but it wouldn't hurt [18:36:49] nuria_: don't worry, you'll get a second chance to watch [18:36:51] i think we should do it again [18:36:58] i'll restart eventlogging with a manual increase of retries [18:37:03] in prod [18:37:05] ja? [18:37:29] ottomata:ok, now that i am looking ta teh right logs [18:37:32] *at the [18:37:40] hm [18:37:46] and/or i could increase retry_backoff_ms [18:37:49] its default is 100 [18:40:56] ottomata: what does retry_backoff do? [18:41:58] An artificial delay time to retry the [18:41:58] produce request upon receiving an error. This avoids exhausting [18:41:59] all retries in a short period of time. Default: 100 [18:42:21] ottomata: ah , cause now retries are just "attemps?" [18:42:31] or rather "number of attemps?" [18:42:34] yes [18:42:39] it will retry a produce request that many times [18:42:47] when the broker dies, in flight produce requests will return an error [18:42:56] (when the broker dies, OR when leadership changes, its the same) [18:43:06] that triggers the client to request the new partition layout and new leaders [18:43:17] then the produce requests will automatically be retried to the new leaders [18:43:37] but, if the metadata refresh takes longer than the client will keep retrying [18:43:40] the produce requests might fail [18:43:48] right [18:43:53] i'm not sure if the retry backoff is compounding [18:43:56] it look slike its not [18:44:08] it'll just space out retries for the same produce request 100 ms aparts [18:44:23] for analytics increasing that is probably good. increasing retries is also probably good [18:44:39] maybe backoff=200 and retries=6 [18:44:42] for analytics this is good. [18:44:42] ottomata: even increasing times i do not think you will be able to guarantee delivery of messages upon a restart though [18:44:50] i wouldn't do that for eventbus though [18:45:00] nuria_: it should. [18:45:14] as soon as a broker goes down, or leadership changes (same same) [18:45:27] kafka uses zookeeper to elect a new leader for the relevant partitions [18:45:45] clients are supposed to retry the produce requests [18:45:54] with the new leadership layout after the election has finished [18:46:11] ottomata: ah, i see a new broker will respond [18:46:11] this is how kafka can ensure delivery of messages even if a broker dies [18:46:13] yes [18:46:21] ottomata: it is not the whole cluster, ok [18:46:23] ja [18:46:30] but the client has to do this properly [18:46:33] this is not handled by the brokers [18:46:45] the client just gets an error when it tries to produce [18:46:55] that is what triggers it to ask the kafka cluster for its new metadata layout [18:47:08] then it will reconfigure its produce requests internally [18:47:12] so that they are retried to the correct brokrers [18:47:26] say topicA has partitions 0,1,2 [18:47:33] brokers 00, 01 and 02 [18:47:44] each of whcih is the prefered leader for a partition [18:47:48] if we stop broker 00 [18:48:01] topicA partition 0's leadership will be given to one of the other brokers [18:48:06] say broker 01 [18:48:13] it will try 01 upon getting an error on producing [18:48:21] well, it will get an error [18:48:26] and will get an error from 01 until that is the leader [18:48:27] then ask kafka for the new leadership layout [18:48:33] right [18:48:41] then it will use that info to figure out where messages for partition 0 should go [18:48:48] but, if the new layout takes too long [18:48:55] say the election takes too long or something [18:49:08] where 'too long' is longer than the client is willing to retry [18:49:11] then the client will give up [18:49:20] ottomata: right [18:50:03] so, ja, for async analytics use, let's try increasing both [18:50:10] we are ok with trying for a while [18:50:20] 200 and 6 retries [18:50:38] shouldnt' be more than 1.2 seconds, but realisitcally less than 2 at least [18:50:39] that should be plenty of time for the election [18:54:08] !log restarting eventlogging with processors retries=6&retry_backoff_ms=200. if this works better, will puppetize. [18:54:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [19:00:10] ok nuria_ el restarted with those settings, i'm going to restart kafka1013 broker and do election again [19:00:29] ottomata: ok looking [19:03:27] ottomata: I see connection failed but no election yet [19:03:39] nuria_: i haven't done eleciton, you won't see election messages [19:03:41] you'll see things like [19:03:52] kafka.errors.NotLeaderForPartitionError [19:03:54] ottomata: i sahll see connection to anew broker right? [19:03:58] *shall [19:04:03] not sure if it will log that [19:04:07] ottomata: k [19:04:15] Node 13 connection failed -- refreshing metadata [19:04:17] is good [19:04:43] what i'm not sure about is: retrying (0 attempts left). [19:05:33] hmmmmm [19:05:33] actually [19:05:41] i'm reading code [19:05:54] not so sure it logs when the produce request ultimately fails because not enough retries left? [19:05:56] hm [19:06:02] it looks like it just updates some metrics [19:06:02] hm [19:07:22] hm [19:07:27] ok nuria_ not so sure that helped [19:07:32] but i'm going to do election now anyway [19:07:38] this should have fewer messages about disconneced node [19:07:41] sicne 1013 is back up [19:07:45] and this is just a simple leadership change [19:08:54] AHHH nuria_ [19:08:54] i did this all wrong [19:08:55] i only updated the configs for ONE of the processors [19:08:56] doh! [19:08:57] duh [19:09:06] that's why we got to 0 attempts left, we were still using the default settings for most processors [19:09:12] ottomata: ok [19:09:13] ok, going to do this again... [19:09:23] ottomata: right cause processors there are more than 1 [19:09:31] ottomata: i forget, consumers there is only 1 [19:11:05] yeah, that's right [19:11:09] we still want to make mroe mysql consumers [19:11:22] ottomata: right but we need the autoinsert [19:11:26] but we can't cause of that replication issue [19:11:26] aye [19:11:33] ok nuria_ stopping broker again. [19:11:34] ottomata: should we try again? [19:12:19] yes [19:12:22] we trying again now [19:12:49] yeah, rats. [19:13:03] lots of errors when broker goes down. [19:13:03] lots of 0 attempts left messages. [19:13:27] well, i mean, this is better than the process dying :/ [19:13:39] ok, let's get drastic and try again with a long backoff ms [19:13:41] just to see [19:13:45] i'll only change in on processor 00 [19:13:50] and only restart that one [19:14:52] k [19:15:22] logging that one [19:17:05] Hmm actually, nuria_ [19:17:08] going to do it for 01 [19:17:12] yessir [19:17:14] hmmm [19:17:16] hang on [19:17:20] i want to do it for one that has 13 as the leader :p [19:17:24] lemme figure out which one that is [19:17:47] actually, no we'll just restart broker 22 [19:17:47] client 00 is consuming partition 11, which has 22 as leader [19:18:38] PSHHH [19:18:39] still did it. [19:20:53] ottomata: it did not reconnect right? [19:21:34] nuria_: its not about reconnecting [19:21:56] its about having enough time to retry the message to the proper broker [19:21:59] i would think 10 retries and 1 second backoff would be *plenty* [19:22:05] but there are still a few '0 attempts left' message [19:22:18] now, that doesn't mean it would drop, but since this isn't logging the actual failed produce [19:22:27] and since it shouldnt' get to 0 anyway [19:22:32] with 10 retries and 1 second backoff [19:22:44] i am assuming that this is dropping a message [19:22:44] i only see it happening twice for this processor though [19:22:47] which is pretty good, but not perfect [19:22:51] it should be perfect [19:25:42] def better than with pykafka though [19:25:46] so at least its an improvement. [19:26:40] hm, ok welp. [19:27:00] i'm going to puppetize increase in retries and backoff, but not to 10 and 1 second [19:27:02] probably just to 6 and 200 ms [19:27:14] will make a task to investigate this, and then put it off for a while... [19:35:33] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2534689 (Nuria) >I'd like to know how many real users with real non-mobile browsers looked at the site to compare this with the number of people who clicked on i... [19:35:52] ottomata: k [19:36:28] milimetric: yt? [19:37:10] milimetric: the sunburst graphs upon loading by default they display data from the very beginning right? just like the timeline graph cc mforns [19:37:49] Analytics, Analytics-EventLogging: Ensure no dropped messages in analytics eventlogging processors when stopping broker - https://phabricator.wikimedia.org/T142430#2534697 (Ottomata) [19:42:44] nuria_, I think so [19:43:06] nuria_, yes sure [19:44:03] mforns: ok. Will change date limits and defaults in all then. [19:44:11] ok [19:44:13] makes sense [19:44:54] Analytics-Cluster: Install SparkR on the cluster - https://phabricator.wikimedia.org/T102485#2534727 (Ottomata) a:Ottomata>None [19:45:14] Analytics-Cluster, Operations: Queries in Hue always return an empty result set - https://phabricator.wikimedia.org/T128039#2534730 (Ottomata) a:Ottomata>None [19:45:48] Analytics, Analytics-Cluster: Monitor cluster running out of HEAP space with Icinga - https://phabricator.wikimedia.org/T88640#2534733 (Ottomata) a:Ottomata>None [19:48:15] Analytics, Analytics-Cluster, Improving-access, Research-and-Data-Backlog: Hashed IP addresses in refined webrequest logs - https://phabricator.wikimedia.org/T118595#2534744 (Ottomata) Open>declined We don't keep IPs in eventlogging anymore. It would be good to send request-id with webre... [19:48:56] Analytics-Cluster: Hue shows error from varnish when issuing Hive query - https://phabricator.wikimedia.org/T95953#2534748 (Ottomata) Open>declined [19:49:08] Analytics-Cluster: Write wikitech spark tutorial - https://phabricator.wikimedia.org/T93111#2534749 (Ottomata) a:Ottomata>None [19:49:17] Analytics-Cluster: Create HivePartitioner in Camus - https://phabricator.wikimedia.org/T92494#2534750 (Ottomata) a:Ottomata>None [19:49:31] Analytics-Cluster, Operations: Clean up permissions for privatedata files on stat1002 - they should be group readable by statistics-privatedata-users - https://phabricator.wikimedia.org/T89887#2534751 (Ottomata) a:Ottomata>None [19:49:57] Analytics-EventLogging, Operations: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#2534753 (Ottomata) a:Ottomata>None [19:50:31] Analytics-EventLogging, Operations: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#1132416 (Ottomata) Open>declined Setting up eventlogging in codfw doesn't seem very useful, since we won't be setting up the analytics cluster there. Declining this for now. [19:50:52] Analytics, Analytics-Cluster, Operations, Traffic: Enable Kafka native TLS in 0.9 and secure the kafka traffic with it - https://phabricator.wikimedia.org/T121561#2534758 (Ottomata) a:Ottomata>None [19:51:10] Analytics-Kanban: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132#2534760 (Danielsberger) I've run the 1h data set through R and there's a brief summary below, if anyone is interested. My main finding is that a) the data is consistent and will prove ver... [19:52:54] Analytics-Kanban, Patch-For-Review: Change or upgrade eventlogging kafka client used for consumption - https://phabricator.wikimedia.org/T133779#2534763 (Ottomata) p:Triage>Normal [19:53:03] Analytics-Kanban, EventBus, Wikimedia-Stream: Public Event Streams - https://phabricator.wikimedia.org/T130651#2534764 (Ottomata) p:Triage>Normal [20:14:22] Analytics, Datasets-General-or-Unknown: UploadWizard dataset is empty, limn has no data - https://phabricator.wikimedia.org/T112851#2534831 (Nemo_bis) That dashboard seems to contain totally different stats; "decline" might be a more appropriate description. [20:33:45] milimetric, mforns : did you agree on a js formatter? if not we could use the same one used on aqs [20:33:53] mforns, milimetric : for dashiki [20:34:17] nuria_, we did not talk about it, I'm fine with the one in aqs [20:34:29] mforns: k, we can talk about it tomorrow [20:34:33] ok [20:34:57] nuria_, we could use one also for scala and the edit history [20:35:11] mforns: ya, but there i woudl use intelij [20:35:20] aha [20:35:32] mforns: and modify default if needed (unlikely) [20:35:40] mforns: are you guys using another IDE? [20:36:00] joseph uses intelij, I don't use any [20:36:08] dan uses vim [20:36:37] mforns: I also use intelij so i vote for that being the standard [20:36:44] mforns: we can talk about it tomorrow too [20:36:58] yes, that's what I was thinking :] [20:41:36] nuria_: I don't see a reason to declare a standard IDE, but if you'd like a standard format we can use an external program. For js, the easiest to configure is eslint [20:41:41] and I'm sure we can find one for scala [20:42:00] (btw, irccloud didn't ping me, I'm gonna take a look at this matrix thing that Yuvi was praising) [20:42:24] milimetric: w/o agreeimg on a format is hard to do code changes cause you end up changing formatting w/o intending to as everyone has a different format [20:42:48] nuria_: I'm not opposed to standard formatting, just a standard IDE, those are separate things [20:43:33] and nuria_ it's exactly because IDEs change formatting automatically that I hate using them. A code editor should never do that [20:43:48] milimetric: sure, the idea is to share formatting , IDE not needed of course [20:44:23] milimetric: for js it will be great not to have two formatters , aqs alredy uses jshint i think [20:44:25] yeah, eslint is stand-alone and I'm sure it's supported in IDEA and other tools like sublime [20:44:29] nuria_, agree with milimetric [20:44:58] jshint is cool, but I had some problems with it on dashiki. I'm ok with switching back to that, I think it was hard to configure exceptions, but we can figure it out [20:46:44] milimetric: ya, i think aqs uses jshint at running test time: https://github.com/wikimedia/analytics-aqs/blob/master/.jshintrc [20:46:49] cc mforns [20:47:53] yeah, but we have it in our own repository, no reason to stick to that if we don't want [20:48:28] milimetric: dashiki also uses jshint as part of gulp [20:48:58] mforns, milimetric : https://github.com/wikimedia/analytics-dashiki/blob/master/gulpfile.js#L20 [20:49:05] Analytics-Kanban: Compile a request data set for caching research and tuning - https://phabricator.wikimedia.org/T128132#2534934 (BBlack) >>! In T128132#2534760, @Danielsberger wrote: > 3.5% of requests had zero response size, which, I assume, are aborted requests. That seems like a reasonable number to me.... [20:49:21] so easiest is probably using jshint? cc milimetric , mforns [20:49:25] aha [20:50:04] (PS1) Nuria: [WIP] Limit date range on datepicker [analytics/dashiki] - https://gerrit.wikimedia.org/r/303693 (https://phabricator.wikimedia.org/T141165) [20:50:30] let's use jshint [20:50:52] I'm still a little skeptical about auto-formatting, but I'm ok with jshint as a standard [20:51:21] (it misses a few things that I value eslint for catching, but I can always use both) [20:51:53] (so that's the thing - I can make code that passes both but would be auto-formatted different by an automatic formatter) [20:55:45] mforns, milimetric : ok, jshint it is. I * think* that i have configured that as the default in sublime but need to double check. [20:56:26] ok