[02:41:35] Nemo_bis: the discovery team takes care of that dashboard, so I'm not sure what direction to point you in [05:39:44] It's ok, just linking for curiosity [06:03:44] goood morning o/ [06:03:59] joal: I can see that aqs1004-b has finally finished compacting [06:04:11] so I am going to re-image aqs1005 as soon as I get into the office [06:20:15] elukey: o/ [06:20:39] Let's discuss before moving forward, still something weird in there (data size on 1004-b [06:33:48] jurbanik: As nuria pointed out, change has been announced some time back (https://www.mail-archive.com/analytics@lists.wikimedia.org/msg03594.html - We even waited more than planned) [06:34:00] jurbanik: And old data will not be removed [06:37:44] elukey: found the issue on aqs1004-b : Problem of code manually updated [06:37:59] elukey: CPU usage is not cassandra but restbase failing to start [06:45:16] Analytics-Kanban: Stop generating pagecounts-raw and pagecounts-all-sites - https://phabricator.wikimedia.org/T130656#2531978 (JAllemandou) Everything looks good, email sent. [06:48:00] joal: you are right, it was stopped because of a scap deployment issue but then we the releng team fixed it restbase tried to start itself [06:48:29] elukey: hm, didn't get it [06:48:38] yeah sorry [06:49:31] puppet was failing due to a scap deployment issue, therefore aqs wasn't turned up. But when Tyler fixed it for the refinery he triggered a deployment of the scap package everywhere [06:49:44] that probably fixed the puppet issue on aqs1004 triggering restbase restarts [06:50:22] elukey: right [06:50:38] elukey: How do we hande that ? [06:50:49] elukey: stop restbase, or change codebase and restart restbase? [06:54:22] joal: I still don't know how to fix the aqs100[456] restbase issue, I remember that we had a chat about it.. If there is a workaround let me know, we can even disable puppet for a while [06:55:11] elukey: first, one confirmation: restbase is down on aqs1004-a, correct? [06:55:50] elukey: issue comes from code diff from the version we deploy using scap and the version needed by restbase on new cluster [06:58:39] yes it is disabled, but I needed to disable puppet too otherwise it would try to bring aqs up again [07:00:16] elukey: we can update code manually and have restbase starte [07:03:23] super [07:04:16] elukey: we acn take eample from aqs1005 ;) [07:05:11] sorry to ask again but is there a way to fix this issue in a way that deploying to aqs1004 fixes the issue without manual code changes? [07:05:32] it looks weird to me that we have to live hack hosts to make everything working [07:05:47] elukey: two clusters, two code bases, one repo ... I guess your answer is no :) [07:05:50] I know that we have differences with aqs100[123] but there might be a way to have both settings? [07:06:04] all right, answer before the question [07:06:08] I wanted too much magic [07:06:10] :) [07:06:10] :) [07:06:37] elukey: feasible - duplicate code for article data, then rename after (but rename is a pain in cassandra) [07:06:56] so, easier to actually manually change for loading time I think [07:07:25] can you do it now on aqs1004 or do you need extra permissions? [07:07:31] I don't recall what we did last time [07:07:53] elukey: let's check aqs1005 [07:08:47] elukey: in /srv/deployment/analytics/aqs/deploy/src, git diff HEAD [07:09:13] ahh now I recall [07:09:19] :) [07:09:21] you made me remove that bit of code [07:09:37] all right taking some notes so I'll be able to do it for aqs100[56] [07:10:08] elukey: only needed for aqs1004 for now, but for after restart, needed :) [07:10:56] yes yes [07:13:01] 09:12 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy [07:14:55] great :) [07:15:17] super weird, aqs1004-b is 350GiB rather than 280ish as the other instances [07:15:23] elukey: indeed [07:15:29] elukey: couldn't it be log related? [07:16:00] joal: log related to aqs or cassandra? [07:17:12] elukey: don't know ... I'd say restbase, but normally everything is sent to logstash ... [07:17:42] yeah [07:17:44] root@aqs1004:/srv/cassandra-b/data# du -hs [07:17:44] 360G . [07:18:37] more precisely [07:18:39] 360G ./local_group_default_T_pageviews_per_article_flat [07:18:39] 360G ./local_group_default_T_pageviews_per_article_flat/data-95eadc90484111e6875bd593012fb45b [07:21:44] elukey: :( [07:23:18] joal: SSTable Compression Ratio: 0.24123920771965465 [07:23:31] this is nodetool-b tablestats [07:24:04] sorry maybe that was the wrong table [07:24:08] checking again [07:24:16] but could it be the compression settings? [07:25:44] I am checking diffs between local_group_default_T_pageviews_per_article_flat on a and b [07:26:20] elukey: number of keys differs a lot [07:26:36] meaning -b has a lot more data than -a [07:26:41] This is a bit weird to me [07:27:38] I was about to say the same [07:27:50] Number of keys (estimate): 346103704 and Number of keys (estimate): 679768001 [07:28:43] I am wondering if bootstrapping two instances at the time could bring these weird side effects [07:28:54] but theoretically the answer should be no [07:29:41] elukey: Don't know what to think :( [07:33:49] elukey: let's move forward and maybe ask urandom, or try reinstanciate aqs1004-b? [07:36:16] joal: I think that the procedure that we are following may have messed up the token distribution.. I'd like to spend a bit of time in understanding why aqs1004-b has more keys, even if nodetool status does not show anything unbalanced.. If we don't find any good solution it might be safer and quicker to just nuke all the nodes and restart from a clean start [07:36:47] elukey: ok [07:36:52] I am afraid that if we don't know exactly what we are doing we might get huge issues in the future when we switch [07:38:49] it might be that aqs1004-b has a lot of "unused" waiting to be cleaned? [07:39:02] (I am reasoning out loud, might be all garbage) [07:39:34] elukey: hwat doesn unused mean? [07:43:42] I am wondering if bootstrapping two instances might have confused how the tokens were assigned, ending up (for some reason) to a temporary assignment of more tokens to aqs1004-b. Eventually the ring balanced and aqs1004-b didn't delete tokens that it is not using [07:43:58] a bit of a stretch but I am not sure how cassandra behaves in these situations [07:44:12] elukey: one way to know that is call a repair on aqs1005-b [07:44:20] elukey: aqs1004-b sorry [07:47:09] yeah I was trying to read what is the best way to solve these situations [07:47:19] ahhh cassandra, sometimes I hate you a bit [07:48:31] I started nodetool-b cleanup [07:49:18] k elukey, we'll monitor :) [07:50:08] yesssssss [07:50:12] it is deleting a lot [07:50:33] or at least the trend seems good [07:50:36] :) [07:50:36] fingers crossed [08:20:40] -30GB for the moment, it looks very promising.. if this turns up to be the solution I'd try to bootstrap only one instance at the time [08:22:57] elukey: k [08:44:06] joal: I am going to re-image aqs1005, I think that we are good to go now [08:44:30] elukey: clening done? [08:44:46] almost, but the node is up when the cleanup happens [08:44:53] elukey: okey [08:45:11] I'll wait another half an hour just in case [08:45:15] then I'll re-image.. [08:45:22] elukey: as you wish :) [08:45:45] it looks good now, aqs1004-b is not 294GiB [08:45:48] much better :) [08:45:54] *now [08:46:01] elukey: possible reason for the thing is aqs1005-b receiving both data for a and b, but the weird thing is that it looks that a didn't receive data for b :( [08:47:32] yeah it seems that the ring-token-mess happened only for the b instance, not sure why.. it explains also why it took so much time [08:47:43] yes [08:47:45] weird [08:49:46] joal: totally unrelated but it might interest you - https://github.com/wikimedia/mediawiki/blob/d5c40c584a581f1dc6a184617d6bfd1629466bc9/includes/objectcache/ObjectCache.php [08:49:48] Analytics-EventLogging, DBA, ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#2532122 (jcrespo) @Jdforrester-WMF It seems nobody complained, ok to irreversibly delete these tables? [08:50:18] it should be the various abstractions around mediawiki caches [08:50:30] (APC, mysql and memcached) [08:50:41] elukey: I can't read php, my eyes bleed too much [08:50:52] :D [08:52:13] I have the same problem! But it is fashinating [08:52:23] the abstraction looks really nice [08:52:28] elukey: why is it? [08:53:29] I am not sure how well this class hides low level details about where an object that you want to store goes [08:54:22] but from a quick view it seems that mw developers calls this class and specify the "kind" of cache that they need (local/fast on host, persisted on the same DC, replicated, etc..) [08:54:38] elukey: nice :) [08:54:56] and the object then ends up in mysql, memcached or APC [08:55:02] transparently afaiu [08:55:04] very nice [09:47:10] aqs1004-b's deletions seems not stopping :D [09:47:16] now 235GiB [09:51:31] it seems going into each db file and clean up unused keys [09:51:34] one by one [09:51:49] (files not keys) [10:16:14] ok I am bootstrapping cassandra-a on aqs1005 [10:16:42] but cassandra-b's sstables on aqs1004 are still deleting keys [10:16:43] sigh [10:31:05] joal: I am going to leave things going for today, but I suspect that the best path forward (and the quickest) is to wipe everything [10:31:11] and restart loading from scratch [10:31:17] we are loosing too much time with this [10:31:23] even if we are learning a lot of things [10:31:54] elukey: I'd rather not change strategy every week ) [10:33:14] joal: yes but the agreement was that if the time to re-load data was less than realoading we'd have continued [10:33:21] this is not happening :) [10:33:51] plus the cleanup looks weird, the sstable dropped below 200GiB [10:34:40] elukey: your call, reloading is not an issue :) [10:37:33] joal: I don't mind to take the call but it was more an open question rather than a statement [10:37:39] :) [10:37:47] I am following what we discussed last week, that's it [10:38:22] Now that we have invested some time into trying to bootstrap, I think the learning we are having is interesting [10:38:46] elukey: Imagine adding a few nodes while cluster in production and having those kind of issues [10:40:40] so Eric already discouraged what I am doing to bootstrap instances with live traffic, because he said that it is safer to decom an instance (so data shuffling + cluster rebalance) to then re-add it in the ring. Adding a new instance only is easier and it shouldn't cause this amount of pain :) [10:41:23] what I'd like to do it is balance results with knowledge [10:41:37] otherwise the new AQS cluster will go nowhere :( [10:42:39] nodetool-b cleanup finished on aqs1004 [10:42:43] good news [10:42:56] 173G ahhahaha [10:43:01] joal: --^ [10:43:12] elukey: interesting info is maybe doing regular cleaning can gain us some space :) [10:43:46] I cleaned all the instances and this one is the only one that dropped so heavily :/ [10:44:41] elukey: you launched clean on every instance this morning? [10:45:06] one at the time, before starting aqs1004-b [10:45:14] but they completed very quickly [10:45:28] elukey: a repair might be good for aqs1004-b I think [10:45:53] I think my brain will need a repair if I keep going in this way :P [10:46:37] elukey: and now that bottstrap of 5-a has started, I think it's wiser to actually stop and reload --> we might end up with data corruption, since a node is down (5-a), and one is possibly corrupted [10:47:07] elukey: going with the process we used, we should have: done one in [10:47:24] - bootstrap instances one by one [10:47:36] - If issues, solve the issue before bottstrapping again [10:48:08] multiple actions involving data reconstruction at the time seems not to work for cassandra [10:48:20] yes it makes sense [10:49:56] so I have stopped aqs1005-a and now I am going to run nodetool-a repair [10:50:16] sorry nodetool-b repair on aqs1004 [10:50:24] reading from the docs it should be enough [10:50:36] elukey: wrong idea - if 4-b and 5-a share some data, there are chances we'll corrupt data [10:50:57] cause there is only one source, so no possibility of quorum [10:52:17] 5a is down and not having any data, how can it possibly corrupt anything? [10:52:29] 4-b will repair using also the other replica [10:52:48] or maybe I am not following [10:53:26] elukey: replication factor is 3, so if 4-b and 5-a shared some data, now that 5-a is down, only one other instance has the same data - how to know who (from 4-b and the other) has the correct one? [10:57:28] 4-b is missing some data probably, the other node should be able to provide the missing pieces [10:57:33] elukey: in other works, there is a possiblity of not being able to have qorum (given that 5-a is down) [10:57:34] this is my assumption [10:58:22] what is your suggestion? [10:58:35] reload [10:58:50] ah ok, so completely wipe [10:59:09] having 2 instances possibly sharing data corrupted doesn't allow us to repair [10:59:37] could lead to inconsistencies yes [10:59:57] elukey: in order to repair, we should have had put 5-a down by downgrading [11:00:19] elukey: this way, cassandra would have had re-sent data in order to reach quorum 3 for the bits of data 5-a owned [11:00:52] but currently, there are only 2 instances having data that has been wiped from 5-a [11:01:13] mmm probably I would have needed to wait for the cleanup, and leave 5-a up. I thought that cleanup was completing and that it would have been a safe op, but I was wrong [11:01:31] if possibly-corrupted-4-b is one of them, then cassandra can't decide wichi instance owns the truth [11:02:01] elukey: 't's'ok, learning :) [11:02:22] I'd need to check the repair techniques first because there might be the chance that Cassandra knows what to do, but I am a bit ignorant so I'll follow your advice [11:02:22] elukey: and now we have a good reason to fully wipe and reaload ;) [11:02:27] yeah [11:02:28] :) [11:02:42] ok proceeding, you should be able to load again soon [11:02:52] elukey: maybe you can tell cassandra that a node is possibly corrupted [11:02:55] elukey: WAIT [11:03:08] elukey: let's read and see if we can figure out a way of repairing [11:04:13] I was reading read repairs docs :) [11:04:53] precisely http://www.datastax.com/dev/blog/advanced-repair-techniques [11:08:53] but even if we have a way to hint would we trust the data stored/repaired? [11:09:12] I actually don't know [11:10:26] thinking out loud, the cleanup has reviewed all the db files and deleted keys that it thought not needed by the instance. The other replica should be able to correct this when they compare their status [11:10:33] elukey: every post I read about getting back consistency uses cassandra backup/restore [11:11:43] elukey: but we don't have enough space to enable that [11:11:53] maybe we can try the repair and see how it goes, as a learning experiment. I am not sure that we'll be able to trust the current data [11:12:07] so with possibility of quorum being lost and corrupted dzata associated, I think reload is the only way [11:12:48] elukey: completely feasible [11:12:53] all right, can I proceed? [11:12:58] please :) [11:13:29] elukey: I think it'll be interesting to follow what repair says, and try to understand if it fails for instance due to not being able to reach quorum [11:14:03] elukey: I could take my break now, is that ok for you? [11:14:23] Did not get positive replies from all endpoints. List of failed endpoint(s): [11:14:54] so two instances down are a serious problem [11:15:01] elukey: for sure it is [11:15:18] wow I didn't think it would have been so bad [11:15:22] elukey: but it's different ahving 2 instances down and having 2 instances corrupted ! [11:15:41] sure sure, byzantine failures are always a sneaky issue :D [11:15:52] ok so joal I am going to re-image 1006 and wipe all [11:15:57] confirm? [11:16:15] elukey: 2 instances down - you can't write, and if at least one startup without corruption, things are ok [11:16:28] cause you can repair the corrupted one [11:16:41] yep yep [11:16:44] 2 instances corrupted - WRONGGGGGG [11:16:49] DDIIIING [11:16:50] Please wipe it all :) [11:16:53] game over [11:16:54] :P [11:16:57] DING DING SING [11:16:58] all right, have a good break [11:16:59] Indeed [11:16:59] ahahhahaha [11:17:20] elukey: By chance we can reload reasonably easily [11:17:35] Taking my break, will start loading when I get back I guess ;) [11:17:46] elukey: You should actually wipe it all, right ;) [11:18:03] 1004 as well [11:18:06] starting afresh [11:18:25] sure sure [11:22:17] going to eat something while 1006 is reimagining [11:58:57] Analytics-Tech-community-metrics: Deployment of Gerrit Delays panel for engineering - https://phabricator.wikimedia.org/T138752#2532785 (Aklapper) p:Triage>Normal [12:33:51] joal: cluster up and running [12:34:05] I think we'd need to add fake data [12:34:15] (to pass the health checks) [12:34:22] but everything is up and running [12:35:13] I am going to add a summary of what has happened to the phab task [12:35:17] and I'll inform the team [12:36:29] (the raid10 arrays on aqs100[45] are re-syncing since I wiped all the data) [12:59:19] Analytics-Kanban: Replace RAID0 arrays with RAID10 on aqs100[456] - https://phabricator.wikimedia.org/T142075#2532948 (elukey) TL;DR: -------------- Today we decided to finish the hosts reimage and wipe the whole cluster to avoid possible data corruption issues. Long explanation: -------------------- Sta... [12:59:35] hope that the explanation is good enough [12:59:37] :) [13:14:59] a-team going afk, just sent an email o/ [13:16:02] Hey joal & milimetric [13:16:22] I'd like to push the live systems meeting back 1 week and then continue on our biweekly cadence. OK? [13:16:24] hey halfak [13:16:34] ok with me [13:22:58] Great. Thanks milimetric. joal? [13:55:20] milimetric: have you seen the wikitech-l convo "Loosing the history of our projects to bitrot"? [13:55:32] hey halfak, ok with me as well :) [13:55:42] Cool. moving! [13:56:02] ottomata: no, sounds very relevant, looking up [13:56:57] cool ja, i think its mostly about content, but was thinking you might want to chime in about future of edit history stuff [14:06:12] !log Updating cassandra compaction to deflate on newly wiped cluster [14:06:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:09:59] !log Adding test data onto newly wiped aqs cluster [14:10:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:10:11] hey yall, am planning on doing an analytics eventlogging deploy this morning [14:10:14] a-team ^ [14:10:19] lemme know if there are any objections [14:10:41] cool with me [14:11:07] ottomata: no problemo [14:13:21] !log Loading 2016-06 in clean new aqs [14:13:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:30:33] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/reportupdater] - https://gerrit.wikimedia.org/r/303340 (owner: Milimetric) [14:35:17] Analytics-EventLogging, DBA, ImageMetrics: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407#2533133 (Jdforrester-WMF) >>! In T141407#2532122, @jcrespo wrote: > @Jdforrester-WMF It seems nobody complained, ok to irreversibly d... [14:44:49] hey milimetric [14:46:45] hey joal [14:46:47] what's up [14:47:07] milimetric: looking at scaling history rebuild using your sqoops [14:47:32] milimetric: I'll have some questions / ideas when you'll have time [14:48:01] omg joal I just accidentally deleted the directory I put them :( [14:48:09] there's no undoing rm -r is there? [14:48:14] omg, so stupid... [14:48:20] hm hm, I don't think there is [14:48:22] joal: batcave? [14:48:24] sure [15:12:21] Analytics-Kanban: User history: Fix the oldUserName and newUserName in blocks/groups log events - https://phabricator.wikimedia.org/T141773#2533402 (Nuria) Open>Resolved [15:13:16] !log deploying eventlogging/analytics - kafka-python 1.3.0 for both consumers and producers [15:13:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [15:13:27] (PS14) Joal: [WIP] Refactor Mediawiki History scala code [analytics/refinery/source] - https://gerrit.wikimedia.org/r/301837 (https://phabricator.wikimedia.org/T141548) [15:15:44] joal: ok, cave again? [15:16:04] yup milimetric, OMW ! [15:18:24] milimetric, joal, are you guys going to dicuss edit history? [15:18:46] come on in, mforns [15:18:50] ok [15:20:22] Analytics-Cluster, Analytics-Kanban, Deployment-Systems, scap, and 2 others: Deploy analytics-refinery with scap3 - https://phabricator.wikimedia.org/T129151#2533451 (Nuria) Open>Resolved [15:21:07] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533453 (BBlack) So, before seeing this ticket I hadn't been looking at the URL/hostname patterns of these requests. Now that I am: In the US, we're seeing the... [15:48:57] Analytics-Dashiki: Mediawiki storage package shoudl request files with 1 hour ttl - https://phabricator.wikimedia.org/T142395#2533536 (Nuria) [15:49:15] Analytics, Analytics-Dashiki: Mediawiki storage package should request files with 1 hour ttl - https://phabricator.wikimedia.org/T142395#2533548 (Nuria) [15:56:55] milimetric: can you chimein on the comment here [15:56:56] https://gerrit.wikimedia.org/r/#/c/301284/14/jsonschema/mediawiki/page/delete/1.yaml [15:57:04] about required user_groups? [15:57:06] sorry [15:57:09] k [15:57:12] required performer.is_bot [15:57:22] oh [15:57:24] acutally [15:57:25] actually [15:57:34] naw, i think i need you more on the rev_count field comment [16:00:29] mforns: stadduppp [16:35:08] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533764 (Nuria) @Amire80 : what problem are these spikes causing you? The spikes represent real traffic, not per se "user initiated requests" [16:40:43] Analytics, Analytics-Dashiki: Mediawiki storage package should request files with 1 hour ttl - https://phabricator.wikimedia.org/T142395#2533536 (Milimetric) p:Triage>Normal [16:44:34] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533804 (Amire80) >>! In T141506#2533764, @Nuria wrote: > @Amire80 : what problem are these spikes causing you? The spikes represent real traffic, not per se "u... [16:46:21] Analytics, Analytics-Dashiki: Improve initial load performance for dashiki dashboards - https://phabricator.wikimedia.org/T142395#2533834 (Milimetric) [16:47:32] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533839 (BBlack) If we're looking to reduce impact on global statistics interpretation, simply filtering out all requests which have a User-Agent string containi... [16:47:57] Analytics, MediaWiki-API, RESTBase-API, Services: Top API user agents stats - https://phabricator.wikimedia.org/T142139#2524050 (Milimetric) p:Triage>Normal [16:49:06] Analytics, Pageviews-API: Pageview API Capacity Projections when it comes to storage - https://phabricator.wikimedia.org/T141789#2533851 (Milimetric) p:Triage>Normal [16:49:37] Analytics, Operations, Ops-Access-Requests: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2522728 (Ottomata) This was talked about in ops meeting today. Ops would prefer that we create another group, `deploy-aqs` perhaps,... [16:50:56] Analytics, Operations, Ops-Access-Requests: Add analytics team members to group aqs-admins to be able to deploy pageview APi - https://phabricator.wikimedia.org/T142101#2533860 (Ottomata) Hm, there is already an `aqs-users` group. Should we reuse it for this? [16:50:59] Analytics, Monitoring, Operations: Switch jmxtrans from statsd to graphite line protocol - https://phabricator.wikimedia.org/T73322#772448 (Milimetric) @elukey we put this on Q2, let me know if it should be earlier. [16:52:15] Analytics-Kanban, Spike: Spike - Slowly Changing Dimensions on Druid - https://phabricator.wikimedia.org/T134792#2533865 (Milimetric) p:Triage>Normal [16:52:19] Analytics-Kanban: Browser dashboard blogpost - https://phabricator.wikimedia.org/T141267#2533866 (Milimetric) p:Triage>Normal [16:52:30] Analytics-Kanban: Continue New AQS Loading - https://phabricator.wikimedia.org/T140866#2533867 (Milimetric) p:Triage>Normal [16:52:35] Analytics-Kanban: Scale scala algorithms using graph partitioning - https://phabricator.wikimedia.org/T141548#2533868 (Milimetric) p:Triage>Normal [16:52:39] Analytics-Kanban: Productionize edit history extraction for all wikis using Sqoop - https://phabricator.wikimedia.org/T141476#2533869 (Milimetric) p:Triage>Normal [16:52:47] Analytics-Kanban: User history: Adapt the user history reconstruction to use scaling by clustering - https://phabricator.wikimedia.org/T141774#2533870 (Milimetric) p:Triage>Normal [16:52:54] Analytics-EventLogging, Analytics-Kanban, EventBus, Patch-For-Review: Change or upgrade eventlogging kafka client used for producing - https://phabricator.wikimedia.org/T141285#2533871 (Milimetric) p:Triage>Normal [16:52:59] Analytics-Kanban: EventBus Maintenace: Fork child processes before adding writers - https://phabricator.wikimedia.org/T141470#2533872 (Milimetric) p:Triage>Normal [16:53:04] Analytics-Kanban, EventBus, Services, User-mobrovac: Improve schema update process on EventBus production instance - https://phabricator.wikimedia.org/T140870#2533873 (Milimetric) p:Triage>Normal [16:54:06] Analytics: Pageview API demo doesn't list be-tarask - https://phabricator.wikimedia.org/T119291#2533878 (Milimetric) Open>Resolved a:Milimetric looks like it works in the new tool: http://tools.wmflabs.org/pageviews/?project=be-tarask.wikipedia.org&platform=all-access&agent=user&range=latest-20&p... [16:55:31] Analytics, Security-Reviews: Security review of Analytics Query Service - https://phabricator.wikimedia.org/T114918#2533896 (Milimetric) Open>Resolved a:Milimetric @dpatrick if you think this is still needed, please reopen or ping us. The service has been up for a few months and we reviewed... [16:57:19] Analytics, Datasets-General-or-Unknown: UploadWizard dataset is empty, limn has no data - https://phabricator.wikimedia.org/T112851#2533911 (Milimetric) Open>Invalid no longer relevant, metrics available in new dashboard: https://edit-analysis.wmflabs.org/multimedia-health/ [16:58:19] joal: o/ [16:59:04] elukey: \o [16:59:35] Analytics: hourly pageview dumps can contain empty title - https://phabricator.wikimedia.org/T90629#1063642 (Milimetric) We no longer maintain these datasets, please take a look at the new pageviews dataset: https://dumps.wikimedia.org/other/analytics/ and specifically https://dumps.wikimedia.org/other/pagev... [16:59:48] Analytics: hourly pageview dumps can contain empty title - https://phabricator.wikimedia.org/T90629#2533942 (Milimetric) Open>declined [17:00:26] joal: all good with the new cluster? Not sure if you had the chance to double check [17:00:55] elukey: Data is loading, everything ok so far :) [17:01:21] \o/ [17:01:22] elukey: I had to change the compression, and added the test data [17:01:49] elukey: only weird thing is that some test files are not present (in order to run icinga test) [17:01:50] I didn't remember if you put them somewhere or if they were on a ticket [17:02:29] (the instructions about compression and test data) [17:02:31] elukey: They should be here by default [17:03:02] Analytics: Better publishing of Annotations about Data Issues - https://phabricator.wikimedia.org/T142408#2533952 (Milimetric) [17:03:04] ah, test data: /srv/deployment/analytics/aqs/deploy/scripts [17:04:02] mmm does scap pull the code when puppet installs/configure it for the first time? [17:04:20] probably yes otherwise I don't explain how I could have modified pageviews.js :D [17:04:37] it should on a clean install, yes [17:05:19] thanks bd808 :) [17:07:25] Analytics, Pageviews-API, Reading-analysis: Suddenly outrageous higher pageviews for main pages - https://phabricator.wikimedia.org/T141506#2533974 (Milimetric) @Amire80 so we could try to clean up this data in our pageview data pipeline, but it would be a *lot* of effort. Also, I'm not at all sure... [17:07:42] joal: tomorrow would you mind to refresh to me how to load data to cassandra? I know that the answer is oozie :) [17:07:52] no problem elukey :) [17:08:05] thanks! [17:08:20] all right all is proceeding well, I wanted to double check [17:08:37] Analytics, Operations, Traffic: Correct cache_status field on webrequest dataset - https://phabricator.wikimedia.org/T142410#2533998 (Nuria) [17:08:40] joal: last step is to change the user security settings, we might do it after one or two loads [17:08:52] in the meantime, I'll prepare the puppet changes [17:09:20] sure [17:12:59] milimetric: Heya [17:18:19] milimetric: I moved enwiki data from your folder to mine (using partition structure) to test with mforns [17:18:34] milimetric: Let me know if it causes any issue [17:20:14] all right logging off (again), have a good day/evening people :) [17:24:30] elukey: have a good evening ! Tomorrow [18:20:12] Analytics, Analytics-Cluster: Create new analytics cluster / hadoop role for mediawiki vagrant - https://phabricator.wikimedia.org/T115707#2534354 (Ottomata) [18:23:32] nuria_: , want to do the kafka broker bounce soon [18:23:38] ottomata: sure [18:23:38] yt? want to watch logs with me? [18:24:00] ottomata: sure, i am at library but hopefully i can do video, let me get headset [18:24:14] ottomata: shoudl we tail them on EL box or eventbus box? [18:24:33] *should [18:24:33] ah let's just to IRC [18:24:34] its ok [18:24:39] EL box [18:24:42] ok [18:24:43] this is just for analytics eventlogging [18:26:10] ottomata: on /var/log/upstart? [18:26:13] up [18:26:14] yup [18:26:25] i'm looking at /var/log/upstart/eventlogging_processor-client-side-00.log and /var/log/upstart/eventlogging_consumer-mysql-m4-master-00.log [18:26:29] just to reduce noise [18:26:30] but any [18:26:35] eventlogging_*.log should be active [18:27:05] i'm also tailing the log files in /srv/log/eventlogging [18:27:11] and just watching the msgs/sec of those [18:27:14] with pv -l > /dev/null [18:28:55] ottomata: forgot about pv! [18:29:29] k nuria_ ready? [18:29:39] i'm going to stop kafka broker on 1013 [18:29:44] k [18:29:44] see what happens [18:29:46] then start it [18:29:48] see what ahpepns [18:29:52] then do a leader election [18:29:54] and see what happens :) [18:29:56] k doing [18:30:16] !log restarting kafka broker on kafka1013 to test eventlogging leader rebalances [18:30:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [18:31:04] interesting. [18:31:33]