[00:55:23] Analytics, Analytics-EventLogging, DBA, Research-and-Data: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2596549 (Jdforrester-WMF) Please don't kill the scripts for good, but it's OK to disable them for a bit. [01:42:41] Analytics, Pageviews-API, Wikipedia-iOS-App-Backlog, iOS-app-v5.2.0-Honey: filter suspicious TV channels pageviews from Top Read - https://phabricator.wikimedia.org/T144333#2596593 (JMinor) [01:43:46] Analytics, Pageviews-API, Wikipedia-iOS-App-Backlog, iOS-app-v5.2.0-Honey: filter suspicious TV channels pageviews from Top Read - https://phabricator.wikimedia.org/T144333#2596606 (JMinor) p:Triage>High [01:52:09] Quarry: Forking your own query results in a new one owned by YuviPanda - https://phabricator.wikimedia.org/T144309#2596611 (Huji) Now it works fine. Could you please submit the patch that fixed it here before you close the task? [01:52:18] Analytics, Pageviews-API, Wikipedia-iOS-App-Backlog, iOS-app-v5.2.0-Honey: filter suspicious TV channels pageviews from Top Read - https://phabricator.wikimedia.org/T144333#2596613 (JMinor) For previous discussion of the Top 25 exclusions and our band-aid solution in the iOS client see https://ph... [02:25:45] Analytics, Operations, LDAP: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2596648 (Peachey88) [02:37:04] Analytics, Operations: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2596656 (Tbayer) [02:37:57] Analytics, Operations: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2596371 (Tbayer) @Peachey88 : As explained in the task description, this issue is specifically *not* about LDAP. [05:22:49] Quarry: Forking your own query results in a new one owned by YuviPanda - https://phabricator.wikimedia.org/T144309#2596738 (yuvipanda) I, uh, just restarted redis and flushed out the db :| when bringing it back up after an outage earlier I had started the wrong redis instance... I really should move Quarry... [06:57:49] jdlrobson: Hi ! [06:58:44] jdlrobson: You have many requests running in parallel on the cluster [06:59:12] jdlrobson: While they might finish at some point, resource oversharing is not helping them [06:59:48] jdlrobson: Since each of them is quite big, best practice would be to run them sequentially [07:00:36] there's an icinga alert for the filled-up root partition on stat1001 [07:00:45] Hi moritzm [07:00:48] from jdlrobson ? [07:01:45] the directory /var/www/limn-public-data is 13G big and not under /srv (as aggregate-datasets and public-datasets) [07:02:02] moritzm: ok [07:02:05] (of the 27G root partition) [07:02:19] not sure how recent that is, though [07:02:46] moritzm: I'll take a look, and see if there is anything I can help with [07:04:09] ok moritzm, I know what that is [07:04:27] moritzm: give me a minute triple check communication has not been made [07:04:31] I dropped the apt cache and an unused kernel, but we're still at 0 (the kernels does a bit of overcommitment) [07:04:37] joal: ok [07:05:18] moritzm: nuria has synced a new folder yesterday (or this morning): /var/www/limn-public-data/caching [07:05:25] This represents most of the 13g [07:05:40] ack, that's it [07:05:41] moritzm: This data is available on HDFS, can you please delete the folder [07:05:57] moritzm: I'll communicate with the team [07:06:17] sure, I can also only drop one of the files, though? [07:06:23] it's three files of 4 GB each [07:06:34] moritzm: let's delete all [07:06:39] ok [07:06:47] moritzm: this issue means the approach taken by nuria doesn't work :) [07:06:52] We should find another place [07:07:03] there's plenty of space under /srv [07:07:14] I need to drop for a minute - will be back shortly [07:07:15] 3T remaining [07:07:18] ok [07:07:25] moritzm: 3T remaining ???? [07:07:33] oh yues didn't notice the /srv [07:07:38] great [07:07:57] Thanks moritzm for the heads up, I don't receive cinga alerts for this machine [07:08:12] it's not alerting, it's just a passive check [07:08:32] hopefully we'll be able to provide access to the icinga UI at some point [07:08:57] so that you could have a look at all stat* hosts, e.g. [07:09:11] but the current setup is very unflexible, we're looking into a replacement [07:40:14] o/ [07:41:06] thanks moritzm! [07:42:11] wow weird oozie emails joal, but at least no data errors [07:42:25] for some reason it's now alerting again, but this time for inode usage: DISK CRITICAL - free space: / 50 MB (0% inode=93%): [07:42:52] going to take over and clean it up [07:43:58] mmmm it is back to 100% disk space used [07:45:17] ahhh 13GB in /home [07:45:47] oh my [07:45:52] the winner is... /home [07:45:54] no, the files in /var/www/linmn-public-data/caching are back [07:46:14] moritzm: yes but there is also a duplicate of the home dirs [07:46:20] so 13GB + 13GB [07:46:24] what the heck, I removed these [07:46:42] maybe it is one of the crons syncing data [07:46:45] from other stats [07:46:58] I am almost sure that Nuria uploaded the files somewhere else [07:47:04] and they are rsyncing [07:49:24] so we have something as weird as elukey@stat1001:/home/home/home$ [07:49:40] with ori's home (~3GB) repeated multiple times [07:50:00] and I suspect that I am the one to blame, because I re-imaged months ago the machine [07:50:09] or somebody else messed up with homes [07:50:14] but most probably it is me [07:52:00] !log removed /home/home/home dir from stat1001 to free space [07:52:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [07:52:45] now /home/home is weird [07:53:05] ls -lt shows dirs owned by users up to 2nd of May (IIRC when I reimaged) [07:53:16] but also other weird dirs dated Aug 14th? [07:54:09] ah seems old users [07:54:49] !log removed /home/home dir from stat1001 to free space [07:54:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [07:55:22] goood now free space down to 70% [07:55:24] all good [07:57:48] I think it also makes sense to out linmn-public-data out of the /var partition towards /srv (which has 3T of free disk space) [07:58:02] the other two dirs in /var/www are already symlinked [07:58:05] to /srv [08:02:05] yes it does, I'll have a chat with mr ottomata today [08:20:19] I'm back [08:20:25] Thanks elukey and moritzm ! [08:23:26] moritzm: Email sent to the team [08:24:22] 10:24 PROBLEM - Disk space on stat1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=93%) [08:24:25] ahahah [08:24:26] checking [08:24:58] elukey: I assume the rsync is managed by puppet having a cron ... [08:25:41] ah yes but it is not 22GB [08:25:48] I mean the /var/www [08:25:49] mmmm [08:26:02] I thought they were less [08:26:03] sigh [08:31:59] elukey: weird oozie messages are because load jobs are taking longer than expected [08:32:28] elukey: This is due to cluster a bit overwhelmed by user queries (see my comments earlier to jdlrobson) [08:33:07] ahhh okok makes sense [08:33:09] poor oozie [08:33:14] yeah [08:33:38] Well I mean, not really elukey - With what it makes us endure, I think it's a normal return :-P [08:33:40] but at least it seems that the consistency errors are gone [08:33:43] :D [08:33:46] Yay [08:33:50] ahhaah [08:41:53] all right so stat1003 /srv/reportupdater/output$ dir contains the new caching dir that Nuria created [08:42:00] that gets rsynced to limn-publicdata [08:48:49] all right so to unblock the situation for the moment I'll copy the caching dir to my home [08:48:59] and then I'll delete it on stat1003 [08:49:08] even if I am not sure how it gets published in there [08:49:09] elukey: hm [08:49:09] mmmm [08:49:23] elukey: problem is on stat1001, not 3 [08:49:48] joal: I know :) [08:49:55] elukey: I think the rsync should go to a /srv folder, then we can manually (or puppet) a symlink from /var/www [08:49:56] the dir on stat1003 gets rsynced on stat1001 [08:50:33] yes but I want to restore functionality and then fix the issue with some calm [08:50:49] so [08:50:50] elukey: I don't get it then [08:51:08] what is your doubt? [08:51:14] I don't uderstan [08:51:21] copying the data in your homen [08:51:37] I am not sure if nuria has it somewhere [08:51:51] elukey: I'm pretty sure it's oin hdfs [08:52:10] elukey: But, best would be to comment the puppet cron for the moment [08:52:31] elukey: no communication has been made on that data eing available yet [08:53:51] well we could just move the dir somewhere else [08:53:57] outside the scope of the rsync [08:54:05] then it will be automatically deleted on stat1001 [08:54:11] might be better [08:54:13] elukey: as you wish [09:10:22] Analytics-Tech-community-metrics: Deployment of Gerrit Delays panel for engineering - https://phabricator.wikimedia.org/T138752#2597638 (Qgil) Loading the page takes a while, but ok. It is about delays after all. ;) There are metric about MERGED and ABANDONED changesets there. I would expect to see only me... [09:10:36] !log Moved stat1003:/srv/reportupdater/output/caching to /home/elukey/caching as temporary measure to free space on stat1001 [09:10:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [09:11:15] Analytics-Tech-community-metrics: Deployment of Mediawiki panels - https://phabricator.wikimedia.org/T138006#2597639 (Qgil) "No results found". Is this expected? [09:11:45] Analytics-Tech-community-metrics: Deployment of Demography panel - https://phabricator.wikimedia.org/T138757#2597640 (Lcanasdiaz) >>! In T138757#2597618, @Qgil wrote: > Is this about the "[[ https://wikimedia.biterg.io/app/kibana#/dashboard/Git-Demographics?_g=(refreshInterval:(display:Off,pause:!f,value:0),... [09:12:44] mforns: you there? [09:13:45] in two minutes the /usr/bin/rsync -rt /srv/reportupdater/output/* stat1001.eqiad.wmnet::www/limn-public-data/ sync will run and the space will be freed [09:13:48] BUT [09:13:59] I have no idea how the caching data got there [09:14:12] since there are other crons of report updater to create stuff [09:14:26] Analytics-Tech-community-metrics: Deployment of Mediawiki panels - https://phabricator.wikimedia.org/T138006#2597641 (Lcanasdiaz) >>! In T138006#2597639, @Qgil wrote: > "No results found". Is this expected? No :-/ . Working on it .. [09:15:15] but afaiu nuria copied it manually [09:15:58] Analytics-Tech-community-metrics: Deployment of Gerrit Delays panel for engineering - https://phabricator.wikimedia.org/T138752#2597643 (Lcanasdiaz) >>! In T138752#2597638, @Qgil wrote: > Loading the page takes a while, but ok. It is about delays after all. ;) > > There are metric about MERGED and ABANDONE... [09:18:09] !log deleted /var/www/limn-public-data/caching on stat1001 to free space [09:18:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [09:21:14] 11:19 RECOVERY - Disk space on stat1001 is OK: DISK OK [09:27:18] joal: completed the clean up work for port 7000 between Cassandra and hadoop FYI [09:27:30] I am running puppet on aqs100[123] not [09:27:32] *now [09:27:42] if you see anything weird let me know [09:32:21] Analytics-Tech-community-metrics: Deployment of Mediawiki panels - https://phabricator.wikimedia.org/T138006#2597670 (Aklapper) "No results found" should get fixed in the next days; data gathering still in progress afaik. [09:39:06] Analytics, Beta-Cluster-Infrastructure, Services, scap, and 3 others: Set up AQS in Beta - https://phabricator.wikimedia.org/T116206#2597678 (elukey) >>! In T116206#2595478, @bd808 wrote: >>>! In T116206#2582429, @elukey wrote: >> Thanks for reporting, this is my bad since analytics_hadoop_hosts... [09:50:37] also ready to rolling restart aqs100[123] to check performances joal [09:50:57] k elukey [09:57:14] elukey: question: It seems I don't have access to druid100[123] machines - Is that normal? [09:58:40] if andrew wants to mess with you yes :P [09:58:49] huhuhu :D [09:59:03] kidding, probably he didn't add the admin group in puppet [09:59:05] let me check [10:08:21] yes.. going to figure out where it is best to put the admins data [10:08:46] there is a hieradata/eqiad/druid.yaml but not sure if hieradata/role/druid would be better [10:10:24] elukey: I can't say ... I don't think we're gonna have druid clusters in other DCs [10:12:12] yeah but most of the hiera config for admins is in role so I am going to stick with the convention.. checking the druid role now [10:20:55] joal: you should be able to access now [10:21:26] I added analytics-admins/roots to the druid hosts [10:21:50] elukey: Awesome, testing [10:22:06] elukey: Working :) [10:22:11] elukey: Thanks a lot mate ! [10:23:32] now I am wondering if this should have been an access request [10:23:33] mmmm [10:25:10] ah joal I'd need to revert the change [10:25:13] for two reasons [10:25:23] 1) analytics-roots gives full sudo access to analytics [10:25:37] 2) analytics-admins gives sudo for oozie/hive/etc.. [10:25:43] that are not there :) [10:25:52] elukey: generates failures ? [10:25:55] so the best thing would be to create a druid-admins users [10:26:04] with correct sudo permissions [10:26:06] k [10:26:08] and go through access [10:26:09] ok? [10:26:17] I'll help you if you need anything with access [10:26:22] elukey: I don't need druid sudo, just needed to access the host [11:38:07] joal: I restarted only aqs1001 till now since I say compactions ongoing, but look what happened to latency [11:38:57] elukey: hehe [11:43:07] elukey: what is grafana dashboard showing an instance system metrics? [11:43:13] elukey: I can recall it :( [11:43:25] elukey: must be the tenth time I ask... sorry [11:44:07] server-board? [11:44:28] YESSSSS ! [11:44:31] Thanks :) [11:44:35] :) [11:46:41] all right cluster restarted [11:46:44] all good [11:46:52] I am going to lunch and then I'll double check metrics [11:46:52] great, thanks elukey [11:46:55] k [11:47:16] this one is very interesting https://grafana.wikimedia.org/dashboard/db/aqs-cassandra-system?panelId=7&fullscreen [11:47:53] elukey: just got an idea, we'll discuss that before standup [11:49:09] sure [12:07:47] taking a break a-team, see you in a bit [12:09:06] hallo [12:09:46] I'm trying to run a query of webrequest using beeline on stat1002, and it doesn't seem to do anything after [12:09:53] Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1472219073448_14336 [12:10:31] usually it types map reduce status lines, and takes up to minutes, but now it has been stuck for a much longer time [12:16:52] aharoni: hello! I think that the cluster is a bit overloaded atm, this might explain the problem [12:17:20] elukey: OK, I'll wait patiently [12:17:22] thanks [12:18:31] thank you! :) [12:41:43] Analytics-Kanban: AQS Cassandra READ timeouts caused an increase of 503s - https://phabricator.wikimedia.org/T143873#2597981 (elukey) During the last analytics ops sync we decided to test a rolling restart of cassandra on aqs100[123] to double check if last month's performance improvements (latency going dow... [12:49:46] Analytics-Kanban: AQS Cassandra READ timeouts caused an increase of 503s - https://phabricator.wikimedia.org/T143873#2597991 (elukey) Read latency followed the same pattern: {F4419667} [13:00:20] ok IOPs for the raid arrays drop right after the cassandra restart [13:00:34] causing a drop in read latency and also in response time [13:24:42] urandom: hi! If you have time I'd need a cassandra consult about --^ [13:25:00] the above phab task contains also some data [13:25:27] this is really weird [13:25:38] but there are 1000 things to check :D [13:26:09] the only big trace left by Cassandra seems to be the disk IOPs [13:51:19] mforns: btw, try not casting to string in the *HistoryRunner sql queries, the sqoops should be already cast [13:51:32] if they're not, I've gotta sqoop with the latest code maybe [13:51:45] (which I should do anyway so we have fresher data when Erik looks) [13:58:12] milimetric: o/ [14:00:56] elu hiii [14:01:01] why no analytics-admins/roots on druid hosts? [14:01:03] i would've done that too [14:02:21] hiiiiiii! [14:02:43] so I started to have tons of doubts [14:02:54] 1) analytics-roots gives full sudo afaiu [14:03:16] 2) analiytcs-admins gives sudo for stuff not running on druid IIRC (oozie, hive, etc..) [14:03:22] Analytics, Analytics-EventLogging, DBA, Research-and-Data: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2598171 (DarTar) I paused all the cronjobs for the ee-dashboards and will help @HJiang-WMF and @Milimetric cherry-pick those that need t... [14:03:26] and 3) do we need an access request? [14:03:36] so I reverted waiting for you :) [14:03:45] Analytics, Analytics-EventLogging, DBA: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2598172 (DarTar) [14:04:43] Analytics, Analytics-EventLogging, DBA: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2598178 (jcrespo) Can I kill already-running jobs? [14:05:27] ottomata: do we need to change puppet to rsync correctly for limn-public-data ? [14:05:43] elukey: don't think so, there is a symlink [14:05:55] elukey: so, analytics-roots gives full sudo [14:05:57] that is its intention [14:06:09] analytics-admins does give sudo perms for stuff that isn't ther,e that's true [14:06:15] so maybe druid-admins makes sense [14:06:26] perhaps analytics-admins should be called hadoop-admins, dunno [14:06:26] but [14:06:30] i think that's a little annoying [14:06:44] yeah :( [14:06:47] generally we want the folks in analytics-admins to be able to do admin stuff on all analytics cluster boxes [14:07:08] puppet doesn't have the ability to not grant the sudo perms if they don't make sense, but they also don't hurt to have on the box [14:07:08] well not full sudo [14:07:15] no? [14:07:30] (I trust you guys just asking, don't look me in a bad way :P) [14:07:33] analytics-roots is sudo [14:07:34] right? [14:07:43] full sudo [14:07:46] yeah it gives root perms basically [14:07:58] yeah [14:08:08] haha, oh but nobody is in that group :p [14:08:16] ahahhaha [14:08:40] elukey: i'd add analytics-admins to that node, joseph just needs shell access there [14:08:50] and it doesn't actually give any permissions yet [14:09:03] or hm [14:09:08] maybe druid-admins would be better [14:09:11] elukey: ahh the thing is [14:09:26] i think we are still a little waffly about whether or not druid will be here for the long run. I feel about 90% about it, which is pretty good [14:09:44] and i guess we can always change group names later, buuuut, its kinda nice to just have a catch all analytics-admins that is useful [14:09:47] it will be the same people [14:13:18] ottomata, elukey: In fairness, I don't mind having hive sudo in places where there is no hive (but I understand if you tell me it's bad nonetheless) [14:13:37] elukey: As said before leaving, I have an idea for cassandra restart improvements [14:13:49] me too! [14:13:50] :D [14:14:16] anyhow, I think that the best procedure would be to create either a "real" analytics-admins or just a druid-admins [14:14:19] elukey: All In ! I go first ;) [14:14:24] and then submit sudo request to ops [14:14:42] BUT I am always the picky one so don't pay attention to me :P [14:15:57] elukey: there is a real analtyics-admins [14:15:59] you mean real analytics-roots [14:15:59] ? [14:18:51] ottomata: didn't we say that analytics-admins is in reality hadoop-admins? [14:19:11] oh [14:19:16] i mean, right now it is i guess ja [14:19:22] sudo capability wise I meant sorry :) [14:19:40] yes, but i would probably just expand its meaning [14:19:43] i'm fine with either [14:19:51] but it seems convenient and fine to me to have it mean more than just hadoop-admins [14:20:24] yep agreed.. a lot clearer [14:20:31] we could make a little refactor [14:20:57] wait, haha, i'm suggesting we leave it as analytics-admins [14:21:02] and just expand what it can do [14:21:27] so, i think you should just add analytics-admins to druid nodes for now. since folks don't actually need any special druid perms atm. [14:21:33] they just need access to the boxes [14:24:35] that would require a phab task :P [14:26:11] re: analytics-admins - does it make sense to have only one? because having one sudoers per "cluster" is handy since you can limit a lot what you can do [14:26:25] I'd refactor analytics-admins to hadoop-admins [14:26:30] and create druid-admins [14:26:40] finally removing analytics-roots [14:26:54] it is painful but more granular imho [14:29:14] yeah it is, but it seems unlikely that we would need that, and will just be more annoying to maintain [14:29:20] but elukey, ja i agree that that is good too [14:29:24] so if you prefer that i'm cool with that [14:29:27] not strongly opinioned here [14:30:56] maybe we can ask this to the team [14:31:00] and check their opinion [14:35:24] aye [14:37:00] milimetric, cool, I will add a patch to the history runners to remove the casts [14:39:39] elukey: hi [14:40:14] o/ [14:44:40] (CR) Joal: "If the reconstruction is scala only, it so far uses files paths, which mean we could go without hive tables (I don't mind having them thou" [analytics/refinery] - https://gerrit.wikimedia.org/r/306292 (https://phabricator.wikimedia.org/T141476) (owner: Milimetric) [14:44:54] elukey: yt? [14:45:22] nuria_: saw your message from yesterday about spark failure [14:45:30] nuria_: have you managed to have ti working? [14:45:46] joal: no, but i just run the query in hive [14:45:53] joal: was about to check results [14:46:04] nuria_: That's weird [14:46:31] nuria_: The errors you posted yesterday about workers being lost are fake errors (expected behavior) [14:47:27] joal: ah, ok, but still manipulating data on spark was talking forever [14:48:28] nuria_: Supposedly it's faster than in hive, that's why I use it for this kind of analysis, but nevermind, I assume it also depends the level of familiarity with the tool [14:48:39] joal: maybe i was doing something wrong but basically i just selected from the table created and called take(10) [14:48:49] joal: and that was minutes and minutes [14:48:53] hm [14:49:06] nuria_: first run is expensive: need to extract a month of pageviews [14:49:37] nuria_: then, if you cache the temp table (not done in my script, my bad), next queries should be really faster [14:50:04] joal: teh cached temp table persists across spark-shell restarts? [14:50:25] nuria_: o/ [14:50:30] also nuria_, cluster is stalled for users from yesterday, I'm waiting for jdlrobson to come online to discuss with him [14:50:43] elukey: hola! let's talk about caching directory in standup, i get there are some space issues? [14:50:49] nuria_: nope, if you want the table to persist, you need to save the data and then read from there [14:50:55] joal: you can kill jdlrobson query [14:50:57] ottomata already fixed the issue [14:51:04] I'll restore the data now [14:51:10] I backupped it in my home dir [14:51:16] nuria_: there are like 20 of them I think, that's actually the issue [14:51:31] joal: he is learning hive so was mentioning that was going through a lot of data and was trying to figure out how to make his pass samller [14:52:00] hm . I don't like to kill people queries nuria_. [14:52:17] joal: I talked with him about these just yesterday though [14:52:48] joal: about how he needed to reduce his data size but those were his 1st hive queries so he wasn't familiar with partitions and such [14:53:30] nuria_: partitionning was not optimal, but most problematic thing is launching a lot queries in parallel [14:53:39] jdlrobson: for when you come online --^ [14:54:20] joal: can we sandbox those even more so they do not affect other cluster business? I remember we reduced resources for user space [14:54:58] nuria_: Each query reads one day of webrequest data, and having them in parallel means they all have a very small amount of resources because they share it, meaning, at some point, it'll finish, but in the meantime there is no available resource for other regular users [14:55:25] nuria_: No impact on prod business, impact on other users only (like amir for instance) [14:55:48] joal: i see, and we cannot further reduce resources by user? [14:56:14] nuria_: resource quota management is really overkill for our use cases (difficult to set up and maintain) [14:56:30] I think teaching our users is the best scenario :) [14:57:47] nuria_: Big data tools tend to abstract computation cost to users - We should make sure they understand the resource cost of what they are doing [15:00:48] elukey: standduppp [15:27:58] elukey: can we move the data into /aggregate-datasets? that way i can delete it from 1002 [15:28:27] nuria_: sure I am looking into that, but I thought you put them on stat1003 [15:28:52] or are you saying that you have that also in stat1002? [15:29:03] (sorry too many rsyncs running between stat* :P) [15:31:02] basically what I know is [15:31:03] 15 * * * * /usr/bin/rsync -rt /srv/reportupdater/output/* stat1001.eqiad.wmnet::www/limn-public-data/ [15:31:03] elukey: data comes from 1002 and i rsycn-ed to 1003 so i t woudl made it to 1001 http endpoint [15:31:15] *would [15:31:18] ahhahaha [15:31:22] double jump [15:31:24] didn't know that [15:31:49] elukey: cause data comes from hive [15:31:58] elukey: but 1003 does not have access to hive [15:32:17] elukey: and i cannot put data directly in neither 1001 or 1002 [15:35:20] okok I need to figure out how to put data on aggregate-datasets [15:35:26] reading puppet [15:35:51] ? [15:35:54] there is an rsync module [15:35:58] ::srv [15:36:11] rsync ... stat1001.eqiad.wmnet::srv/aggregate-datasets/ [15:36:37] ahh so brutally copied in there [15:36:46] yup [15:36:49] I admit that the stat relationships confused me [15:36:51] there might be a cron too [15:36:56] that auto copies [15:37:01] yeah man, me too [15:37:26] cron { 'rsync aggregate datasets from stat1002': [15:37:27] command => "/usr/bin/rsync -rt --delete stat1002.eqiad.wmnet::srv/aggregate-datasets/* ${working_path}/aggregate-datasets/", [15:37:32] this guy in here [15:40:53] but I can't find srv/aggregate-datasets/ on stat1002 [15:42:07] ottomata: I get the stat1001 direct rsync but not the one --^ [15:42:16] that runs in cron on stat1001 [15:44:55] Arf elukey, we forgot to sync on logistics for the conf [15:47:03] joal: ah about druid perms? [15:47:09] sigh you are right [15:47:23] elukey: if we don't we'll never go ;) [15:47:55] I can re-join and steal 5 minutes of your meeting [15:48:00] nuria_: executing rsync -rt /srv/reportupdater/output/caching stat1001.eqiad.wmnet::srv/aggregate-datasets/ [15:48:04] (on stat1003 [15:48:19] elukey: yeah.... stat1002 /a vs /srv is the worst.' [15:48:22] i would like to get rid of /a [15:48:26] but, historically it exists [15:48:37] all other stat servers use /srv as the big data partition [15:48:40] but stat1002 uses /a [15:48:44] nooooooooo [15:48:57] :/ [15:48:57] on the other stat boxes, /a is a symlink to /srv [15:49:00] actually stat1003 not sure, checking. [15:49:14] yeah [15:49:18] but stat1002 both /a and /srv exist [15:49:38] and, to make things more transparent [15:49:41] on stat1002 [15:49:48] the ::srv rsync module [15:49:49] points at /a [15:49:57] in /etc/rsyncd.conf [15:49:58] [ srv ] [15:49:58] path = /a [15:50:23] thanks joal for remembering about the queries I run :) [15:50:34] ottomata: do I need to copy the caching stuff to stat1002.eqiad.wmnet::srv/aggregate-datasets/ to avoid the --delete? [15:50:50] I mean, I am copying from stat1003 to stat1001 [15:51:07] but with the --delete from stat1002 to stat1001 [15:51:14] it'll get purged no? [15:51:48] ahhh --delete [15:51:49] you are right. [15:51:55] maybe public-datasets instead then? :p [15:52:20] ahhahahaah [15:52:30] elukey: this is why we wanted to clean all this up [15:53:34] ok elukeyyeah [15:53:53] if you want the caching dataset source to be stat1003, then you should put it in public-datasets [15:53:55] on stat1003 [15:53:57] and then rsync [15:54:02] that will be the easiest thing to do i guess [15:55:38] ottomata: maybe /srv/public-datasets/analytics/caching on stat1003? [16:12:39] ok now I am running /usr/bin/rsync -rt --delete stat1003.eqiad.wmnet::srv/public-datasets/* /srv/public-datasets/ on stat1001 [16:13:50] nuria_: https://datasets.wikimedia.org/public-datasets/analytics/caching/ is going to be populated in the next 10 minutes :) [16:14:47] elukey: on meeting , will look in a bit [16:26:05] all right done! [16:35:07] all right people I am logging off for a bit, will read the chan later on! [16:35:15] bye elukey :) [16:35:27] joal: sent a meeting invite to discuss the ApacheCon [16:35:38] Great elukey, thanks ! [16:35:44] elukey: back [16:35:57] elukey: all right! [16:36:03] super :) [16:36:07] elukey: will delete data from 1002 and ping those people [16:37:27] * elukey nods [16:37:45] let me know if there is anything left to do later on :) [16:38:14] elukey: i think we are done, will eave ticket open until they can take a look and update e-mail thread [16:40:53] Analytics, Analytics-EventLogging, DBA: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2598801 (DarTar) >>! In T144278#2598178, @jcrespo wrote: > Can I kill already-running jobs? You can definitely kill this one. Please do not kill all queries b... [16:41:47] Hey milimetric, do we go for a 1h session? [16:46:30] omw cave joal [16:46:44] milimetric: joining as well [17:09:31] Analytics, Analytics-EventLogging, DBA: Queries on PageContentSaveComplete are starting to pileup - https://phabricator.wikimedia.org/T144278#2598900 (jcrespo) Open>Resolved >>! In T144278#2598801, @DarTar wrote: >>>! In T144278#2598178, @jcrespo wrote: >> Can I kill already-running jobs? >... [17:26:06] a-team: browser dashborads 2000+ uniques visits, ok, now, that is significant traffic [17:26:23] great news nuria_ :) [17:26:36] nuria_, 2000 per day? month? [17:27:19] mforns: per day [17:27:32] O.o [17:27:35] mforns: ya, [17:28:15] we will see if traffic gets higher from what it used to be after the tweets/blogpost [17:30:07] mforns: we used to have 200/500 uniques per week so that is a significant increase [17:30:42] nuria_, yes! the tweets will for sure have sth to do with that [17:44:00] logging off a-team [17:44:06] Bye ! [17:44:07] bye joal! see you [17:45:15] dr0ptp4kt: question: the trending service being developed ...what is it using as data sources? [17:51:10] laters! [17:57:57] dr0ptp4kt: hola? [18:02:32] hi nuria_ . i haven't dug into the code, but as i understand it is using pageviews restbase endpoint and rcstream (although mobrovac et al have suggest changeprop to jdlrobson). jdlrobson, can you speak further to what that thing is currently made of? it'll be a restbase endpoint [18:03:21] nuria_: Looks llike you've meesedup the backfilling file gain ;) [18:03:55] joal: me NEVER [18:04:21] joal: look now [18:04:21] nuria_: Ok, I stop being so accusative then :) [18:04:44] thx again nuria_ :) [18:04:49] joal: fixed, i was debating whether loading a 3rd month while is compacting the other two [18:05:05] nuria_: I have no opinion [18:06:05] * joal disapeear for real [18:08:04] nuria my service uses rcstream and pageviews apis. It was developed for a pet project of mine > https://trending.wmflabs.org [18:33:17] jdlrobson: ah, yes. I remember. [18:33:45] mostly to play with push notifications and how they might look [19:35:49] (PS6) Nuria: Bookmark for browser dashboard regarding graph and time [analytics/dashiki] - https://gerrit.wikimedia.org/r/306980 (https://phabricator.wikimedia.org/T143689) [19:36:59] (CR) Nuria: Bookmark for browser dashboard regarding graph and time (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/306980 (https://phabricator.wikimedia.org/T143689) (owner: Nuria) [19:38:01] mforns: if you have time to CR the dashiki patch for bookmarks (doesn't have to be now) i can deploy it later together with milimetric 's fix [19:38:17] nuria_, sure, I'll try today [19:40:23] mforns: no rush [19:41:12] ottomata: FYI, i added https://phabricator.wikimedia.org/T125854 (organize directories in http://datasets) to ops goals [19:45:26] great [19:48:46] I have a question about hadoop [19:51:31] bawolff: please [19:51:53] Does hadoop store which cookies users have during a request [19:52:36] bawolff: no [19:52:50] bawolff: only some cookies of interest are stored. see: [19:52:50] :( [19:53:24] Specificly, I'm trying to answer the question of "What are the most common cookies users have, included cookies from user js" [19:54:48] bawolff: we do not store that type of data (even short term) as we do not needed to count pageviews or unique devices [19:54:53] bawolff: https://wikitech.wikimedia.org/wiki/X-Analytics [19:55:02] bawolff: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest [19:55:12] bawolff: these are the request datasets [19:55:50] bawolff: some cookies "of interest" like WMF-Last-Access are published via x-analytics [19:55:55] bawolff: but not all [19:56:20] bawolff: and also by looking whether a user is logged in you can infer that they have "logged -in" cookies [19:56:52] urandom: yt? (and avialble for YET another cassandravquestion!) [19:56:57] urandom: yt? (and avialble for YET another cassandra question!) [19:57:05] *available , argh [19:57:28] nuria_: i can try [19:57:54] bawolff: so, in short, i doubt you can answer your cookie question in an easy way, that is [19:58:00] urandom: I am trying to execute [19:58:07] nuria_: ok [19:58:10] https://www.irccloud.com/pastebin/xAiuI1my/ [19:58:44] urandom: w/o success, but other commands like: [19:58:45] nodetool-a tablestats -- local_group_default_T_pageviews_per_article_flat.data [19:58:46] nuria_: yeah, 4-6 are multi-instance nodes [19:58:51] yeah [19:59:04] nodetool-a is for running against instance 'a' [19:59:37] urandom: ohhhhh [19:59:42] urandom: super useful, thanks [19:59:43] nuria_: so you want: nodetool-{a,b} {compactionstats,compactionhistory} [19:59:59] nuria_: I think I'm going to suggest making a MW extension that generates a log entry when an unrecognized cookie is encountered [20:02:14] urandom: ok, now i need three days to read all the data that thing spits out [20:02:16] urandom: wow [20:02:27] bawolff: a log entry where? [20:02:45] nuria_: heh [20:02:56] logstash [20:02:57] nuria_: btw, i have tools that will return both of those things as json [20:03:06] if that were of help... [20:03:27] urandom: no worries, iam ninja on cmd line [20:03:42] nuria_: compation history output is unparseable [20:03:47] it gets all strung together [20:04:00] bawolff: you might know best but logstash doesn't seem to be looked at much (i might be wrong) [20:04:04] but the history of limited use anyway [20:04:39] urandom: this seems useful: [20:04:42] https://www.irccloud.com/pastebin/vWiZwtIf/ [20:04:45] nuria_: I think this is wanted for a one-time thing [20:05:05] Background context is: Legal wants to know all the cookies that are set by popular user gadgets [20:05:07] nuria_: yeah [20:05:18] nuria_: add -H to see data in human-readable format [20:05:42] and of course: watch -d -- [20:06:10] but mostly I'm thinking logstash, because its really easy to log to from mediawiki [20:07:09] bawolff: if that is the case you could tap into the stream that varnish generates perhaps and look at it for a week , the same stream than varnishkafka reads to send us data [20:07:30] bawolff: this might be not so easy if you cannot get hold of a cache host [20:07:56] hmm, interesting [20:08:15] That goes a bit outside of the bubble of things I know how to do [20:12:12] bawolff: ok, you know best, this is the logging format varnishkafka send us: https://wikitech.wikimedia.org/wiki/Cache_log_format#Varnishkafka_Format [20:12:37] bawolff: by looking at puppet some things are sent to statsv and maybe you can sent what you are interested on [20:12:48] * bawolff will look [20:14:04] urandom: the stats from compactionstats and our graphs in graphana do not seem they have much in common [20:14:11] urandom: https://grafana-admin.wikimedia.org/dashboard/db/aqs-cassandra-compaction [20:14:34] urandom: given this I would expect a bunch of compaction is pending but according to stats is mostly done [20:14:41] nuria_: Thank you for your help :) [20:14:46] bawolff: np [20:14:59] nuria_: pending compactions ought to line up [20:15:16] nuria_: but the rest is apples-oranges [20:15:37] urandom: it says remaining time "23 min" [20:16:06] (CR) Mforns: "Do we want to make Dashiki react accordingly to clicks on the browser's back and forward buttons? If not, this patch LGTM in general. See " (2 comments) [analytics/dashiki] - https://gerrit.wikimedia.org/r/306980 (https://phabricator.wikimedia.org/T143689) (owner: Nuria) [20:16:08] that's probably to be taken with a grain of salt [20:16:23] urandom: ah ok [20:16:40] i think it's based on the throttled throughput (as opposed to rate measured), and the remaining [20:16:48] nuria_, ^ LGTM but shouldn't Dashiki react to clicks on browsers back and forward buttons? [20:18:02] mforns: i agree, but that could also be a separate patch [20:18:13] mforns: That would be something to add to vital signs too as we do not modify history, api is probably supported now though let me see [20:18:14] (because it's not reacting right now) [20:18:16] milimetric, yes I'm totally OK with that [20:18:34] just askin' [20:18:46] milimetric: ya, on vital signs we did not do it on purpose as history api was not everywhere yet, but that was 2 years ago! [20:19:08] aha [20:19:43] mforns: ya, indeed, history api is everywhere [20:19:44] yeah, I remember, we could always do it conditionally [20:20:05] ah good, then yeah, that should be fairly simple to add [20:20:07] milimetric: yes, i argued against it cause i did not want to use a polyfill for history [20:20:22] milimetric: but that argument is of no concern now [20:21:30] nuria_, anyway the patch lgtm, there's these small comments on the duplicate definitions of the regexps, but otherwise I would merge it [20:22:46] mforns: ok, let me see those thank you! [20:23:01] np! [20:25:09] halfak: https://pystitch.github.io/ in case you haven't heard of it. Kind of duplicates what you do with jupyter but maybe you prefer this style sometimes [20:28:02] Ahh yes. Seems like there was hole in python land from RMarkdown