[00:12:55] Analytics, RESTBase, Services, HyperSwitch: Separate AQS off of RESTBase - https://phabricator.wikimedia.org/T126294#2010298 (mobrovac) NEW [01:41:16] Analytics-Kanban, Editing-Analysis, Patch-For-Review, WMF-deploy-2016-02-09_(1.27.0-wmf.13): Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts] - https://phabricator.wikimedia.org/T124676#2010513 (Jdforrester-WMF) [01:41:41] Analytics-Kanban, Editing-Analysis, Patch-For-Review, WMF-deploy-2016-02-09_(1.27.0-wmf.13): Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts] - https://phabricator.wikimedia.org/T124676#1962252 (Jdforrester-WMF) This is now good to go, from 2016-... [07:11:34] (CR) KartikMistry: [C: 2] "TODO: Scripts are OK. Output of these script need to go to limn-language-data graphs." [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/268974 (https://phabricator.wikimedia.org/T108158) (owner: Amire80) [07:11:58] (Merged) jenkins-bot: Adding statistics scripts [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/268974 (https://phabricator.wikimedia.org/T108158) (owner: Amire80) [08:35:45] o/ [09:13:13] Hi elukey :) [10:30:24] the kafka1012 situation is still *bad* [10:30:25] /dev/sdb3 1.8T 1.7T 96G 95% /var/spool/kafka/b [10:30:33] /dev/sdf1 1.8T 1.7T 148G 92% /var/spool/kafka/f [10:30:35] mwarf ! [10:30:46] elukey: have talked with ottomata yesterday about that ? [10:31:43] briefly, it seems that the logrotation happens every 7 days but we started only a couple of days ago getting a huge amount of data to catch up [10:32:01] I don't get it elucker [10:32:02] so until 7 days are expired, the log keeps growing [10:32:25] kafka1012 started a few days ago, and therefore have a lot of data to catch up, right [10:32:45] but why would it use ***much*** more disk space than other brokers ? [10:34:10] this is the same thing that I was wondering [10:34:48] we agreed to wait for today to see the growth before deciding to truncate [10:35:06] ok [10:35:31] elukey: the core main space is used by kafka data, right (not logs of errors or something else) ? [10:36:33] seems only upload and text, as yesterday [10:36:51] yes, other partitions are very small in comparison [10:40:24] joal: I am going to do some work for ops in the next two hours, then I'll restart working on this [10:40:27] :) [10:40:31] k [10:40:32] :) [12:04:15] joal: I took a look to the docs and I re-thought about what Andrew said [12:05:41] the directory with a huge partition (say webrequest_upload-10) contains logs dated Feb 5, that is when we restarted the broker. That data might have been past messages that kafka1012 needed to fetch as part of its recovery. [12:06:00] the problem might be that those files were created on other brokers with an earlier timestamp [12:06:41] so could have already been deleted because their week retention time has passsed [12:06:56] meanwhile the ones on kafka 1012 will be deleted on the 12th [12:07:24] elukey: deletion policy is based on message time, right ? [12:07:37] yep, file timestamp [12:09:09] at recovery time, on feb 5th, I assume kafka downloaded a lot of data from other brokers (catching up a few days) [12:09:37] creating all logs with a more recent timestamp [12:09:42] and keeping them for 7 days [12:09:43] But I wouldn't expect it to tag all of it for feb 5th [12:10:17] well it just created those files, and the unix timestamp will be the one used by the "cleaners" [12:10:19] That was my original question: does kafka delete baed on message timestamp, or data reception timestamp [12:10:34] file timestamp of the log on the file system [12:10:48] -rw-r--r-- 1 kafka kafka 512M Feb 5 16:30 00000000065290382906.log [12:10:51] this one for example [12:10:58] + 7 days [12:12:02] and it makes sense since it considers all the logs as byte streams, so they should be opaque to it [12:12:12] and the only timestamp is the file system one [12:12:15] right ok, makes sense [12:12:39] at least, this is my poor understanding after chatting with the Kafka master ottomata :D [12:12:45] I would have expected kafka to be a bit smarter on getting back after failure :) [12:13:00] But it makes sense [12:13:06] yep me too! We should consider this failure scenario when setting up downtime [12:14:00] So I guess you'll truncate old files with ottomata when he arrives, and we could possibly discuss augmenting the number of partitions for big topics to mitigate that issue [12:14:33] Meaning: worth case scenario: a broker has to have enough disk space for 14 days of data [12:15:08] If spreading webrequest upload and text acroos more disks, we could handle that without issue I think [12:15:12] elukey: --^ [12:15:22] 60 Feb 9 [12:15:23] 148 Feb 8 [12:15:23] 140 Feb 7 [12:15:23] 920 Feb 6 [12:15:23] 243 Feb 5 [12:15:35] this is upload-10 atm [12:15:44] number of logs + timestamp [12:16:06] grand total of ~800GB [12:17:46] joal: not sure what is the best path forward, you and Andrew have more experience than me on this so probably we'd need a discussion :) For the immediate future, Feb 5 and 6 might be good candidate for a truncation but I'd need to figure out how to delete things safely first [12:17:58] even though that data is replicated [12:18:31] but yeah Feb 6 is definitely the issue :) [12:19:05] elukey: http://stackoverflow.com/questions/16284399/purge-kafka-queue [12:19:24] elukey: It's not the exact thing we want, but we could take this idea as example [12:19:42] elukey: basically letting kafka purge the big stuff by itself [12:20:32] could be an option, there are tons of logs probably not needed anymore [12:21:06] also elukey, IIRC, we could set an upper limit to kafka data size per topic (in addtition to the time one) [12:21:29] we might be willing to consider that as well [12:22:47] yep yep [12:25:34] going out for lunch, brb! [12:45:01] back [13:12:07] Analytics-Tech-community-metrics, JavaScript: Syntax error, unrecognized expression on Korma profiles - https://phabricator.wikimedia.org/T126325#2011245 (He7d3r) [13:23:19] Analytics-Tech-community-metrics: Korma: Incorrect ticket count on top contributors page - https://phabricator.wikimedia.org/T126328#2011283 (He7d3r) NEW [13:26:01] Analytics-Tech-community-metrics, JavaScript: Syntax error, unrecognized expression on Korma profiles - https://phabricator.wikimedia.org/T126325#2011293 (Aklapper) Thanks for reporting this! (and playing with korma!) I'll merge those accounts! :) This task might be to some extend its own problem ("Synta... [13:30:10] Analytics-Tech-community-metrics, Developer-Relations, DevRel-February-2016, developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2011296 (Aklapper) a:Aklapper>Dicortazar >>! In T103292#198... [13:33:16] Analytics-Tech-community-metrics, JavaScript: Syntax error, unrecognized expression on Korma profiles - https://phabricator.wikimedia.org/T126325#2011300 (He7d3r) [13:45:48] joal: 08:41 PROBLEM - Disk space on kafka1012 is CRITICAL: DISK CRITICAL - free space: /var/spool/kafka/b 73705 MB (3% inode=99%): /var/spool/kafka/f 127290 MB (6% inode=99%) [13:54:53] ottomata: https://dpaste.de/1MVs <--- I broke down the upload/text partition logs by date for upload/text [13:56:16] o/ joal [13:56:26] will be running ~5 min late [13:56:33] See you soon :) [14:00:30] Hi halfak :) [14:00:37] np, will be there [14:03:44] so we could modify log.retention.hours=168 temporary to 4 days to remove the 5th and gain a bit of space [14:06:06] that'd be it elukey :) [14:07:05] it might be enough to do service kafka stop; edit /etc/kafka/server.properties; service kafka start [14:10:28] elukey: in a meeting, but I don't even think we need to restart kafka [14:11:17] if we contact zookeper probaly not, but if we want to modify server.properties yes [14:54:26] wikimedia/mediawiki-extensions-EventLogging#530 (wmf/1.27.0-wmf.13 - 1de03a1 : root): The build has errored. [14:54:26] Change view : https://github.com/wikimedia/mediawiki-extensions-EventLogging/commit/1de03a1d6b6f [14:54:26] Build details : https://travis-ci.org/wikimedia/mediawiki-extensions-EventLogging/builds/108041097 [15:16:38] milimetric: heyyYyy [15:16:42] should we do our thang? [15:17:05] ok, lemme get the build [15:17:15] (CR) Ottomata: [C: 1] "+1, but I don't have much context, so probably someone else should take a look." [analytics/refinery/source] - https://gerrit.wikimedia.org/r/268639 (https://phabricator.wikimedia.org/T125960) (owner: Joal) [15:17:18] k [15:17:31] ottomata: we should take some time for kafka with elukey [15:17:38] Ho, and HI ! :) [15:17:43] oh for dat disk [15:17:43] ja [15:17:45] hallloooo! [15:18:32] haaaalllooo [15:19:35] ottomata: I took a look to the logs layout on the disk, thanks for the hint yesterday :) [15:19:55] oh? [15:20:19] I finally got the ctime/mtime thing [15:21:24] ah ok great [15:21:42] also https://dpaste.de/1MVs is interesting if you haven't seen it [15:22:17] ah cool, yeah makes sense [15:22:21] bout the same for sdb i assume [15:22:25] its pretty full too [15:23:28] elukey: what if on this broker we should temporarily set log.retention.hours=24*2 [15:23:30] and restarted the boker [15:23:31] broker [15:23:49] i thikn it would delete those files from feb 5 and 6 [15:23:52] and free up lots of space [15:24:42] there are per topic settings that we can set on the fly [15:24:54] but, i think that would change the retention policy for all brokers [15:25:22] i also think that it might be fine to stop the broker and manually delete old files [15:25:28] but i would prefer to let kafka do it [15:26:19] ok ottomata, all good: https://gerrit.wikimedia.org/r/#/c/269420/ [15:26:32] deploying can happen, I'm on the three servers and will test locally and remotely [15:26:50] oook, so aqs deploy, will do check then deploy on all 3, ja? [15:26:56] yes [15:27:04] and we'll have to see if the local changes are cleanly wiped [15:27:39] what file was that again? [15:28:13] projects/aqs_default.yaml [15:28:30] but the new code has the same change [15:28:53] k [15:28:57] check looks good [15:28:59] proceeding with deploy [15:29:41] ahhh [15:29:42] msg: Failed to init/update submodules: error: Your local changes to the following files would be overwritten by checkout: [15:29:42] projects/aqs_default.yaml [15:29:45] cool [15:29:47] that's pretty cool [15:29:56] i will revert locally on each [15:30:28] ok deploying again [15:30:31] cool [15:30:36] yeah, it's good that it stops us [15:30:40] ya for sure [15:31:18] ok all 3 look good [15:31:26] k, testing [15:32:05] ottomata: yes I wanted to modify /etc/kafka/server.properties with 24*4 days initially to remove only Feb 5th as test [15:32:12] with service kafka restart [15:32:17] aye [15:32:29] elukey: if you want to proceed, i'll back you up [15:32:29] all right, sending a code review and then proceeding [15:32:35] ah, naw [15:32:40] don't bother with commit on this one [15:32:45] we'll stop puppet [15:32:48] nono unrelated CR [15:32:49] :) [15:32:49] make the change, restart kafka [15:32:52] ohoho [15:32:52] ok [15:32:58] hm, this doesn't exist: https://en.wikipedia.org/wiki/Superb_Owl [15:33:06] haha [15:33:10] that is a shame! [15:33:12] i kno [15:33:14] what an opportunity! [15:33:18] i kno! [15:33:22] ok, all's good tho [15:33:25] tests all pass [15:33:30] ok great [15:33:42] oddly the local tests are taking longer than usual now, but just a bit longer, probably fine [15:35:46] fail :/ https://en.wikipedia.org/wiki/Wikipedia:Article_wizard/Neologism [15:37:11] omg, I'm out, no wonder we don't have editors: https://en.wikipedia.org/w/index.php?title=Superb_owl&action=history [15:38:16] "No such subspecies so redirecting to the order" !!!! [15:40:10] :) [15:41:14] !log kafka broker restarted - kafka1012 [15:49:01] ottomata: shall we try with three days? [15:49:15] hmmm [15:49:28] it didn't really clear up any space, did it? [15:50:17] nope from what I can see, but 4 days gets us to Feb 5th and maybe timestamp are still "valid" [15:50:30] ja [15:50:31] looking at that [15:50:44] with 3 days we'd be sure [15:50:57] elukey: i think you are right [15:51:03] an old one i see is [15:51:03] 512M Feb 5 16:42 00000000048959000755.log [15:51:11] current time is 15:50 [15:51:26] joal, I've edited https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly/Sanitization. Can you check that I did OK? [15:51:31] elukey: let's do 2 days, most of the data is in feb 6 [15:51:32] I also posted some notes on the talk page. [15:51:48] ottomata: all right! moving the retention to 48 hours [15:51:51] k [15:51:56] And filed a task to add

to wikitech -- https://phabricator.wikimedia.org/T126338
[15:53:26] ottomata: restarted!
[15:53:41]!log restarted kafka1012 with 48hrs of log retention
[15:53:54] lots of deletions looks like! :)
[15:54:48] yesssssssss
[15:55:05] oh yaa there it goes
[15:55:30] seeing recovery with df -h
[15:55:49] /dev/sdf1       1.8T 1021G  813G  56% /var/spool/kafka/f
[15:56:43] ja
[15:57:19] re-enabling puppet
[15:57:37] joal: https://etherpad.wikimedia.org/p/analytics-notes
[15:57:45] and setting back retention to the original value
[15:57:46] at the top of that I put my draft for the email I want to send on Thursday
[15:58:02] (the reason for sending Thursday is because I need it to make the opening joke work)
[15:58:08] ok elukey deleltions look good
[15:58:08] ja
[15:58:09] continue
[15:58:10] JDD - Joke driven development
[16:00:13] * joal likes JDD !
[16:00:18] milimetric: +1 :)
[16:00:24] Maybe some next steps ?
[16:00:26] ottomata: all done, kafka restarted with 168 hrs retention limit
[16:00:43] nice
[16:00:45] elukey, ottomata : Thanks amilion guys for having handled that kafka thing :)
[16:01:11] joal: I know what the technical next steps are, but I wanted to hold off to see if anyone disagreed.  I'll mention that I'll send next steps if everyone agrees
[16:01:25] joal: I am going to write a note in the Kafka admin page about how to purge logs!
[16:01:32] awesome milimetric :)
[16:02:23] halfak: Thanks for edition and comments
[16:02:36] No problem.  Sorry I didn't get farther.
[16:02:54] * halfak loves the "please read this", "Oh!  I'll edit while I read" pattern. 
[16:09:17] joal:  reading more flink blog stuff
[16:09:19] it looks really awesome
[16:09:40] ottomata: yeah, I agree :)
[16:12:25] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration#Purge_broker_logs
[16:13:24] nice gonna add soething
[16:15:45] elukey:  in your investigations did you notice if it was ctime or mtime?
[16:23:38] ottomata: can you send a link with the flink articles you read ?
[16:25:10] ottomata: I think that both are the same
[16:25:19] sorry, both have the same value
[16:25:32] becuause the log file doesn't get touched
[16:25:36] no?
[16:26:25] ja it is appended to
[16:26:29] until it is rotated
[16:26:40] ja joal just read this
[16:26:41] http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/
[16:26:43] savepoins
[16:26:47] savepoints
[16:26:52] in 1.0, which is not released yet
[16:26:54] right: statefull straming !
[16:27:05] sounds great, heh :)
[16:30:56] halfak: how should I answer your comments on talk page (I actually have never done that, can you believe it )
[16:31:04] Analytics, Analytics-EventLogging, EventBus: Make eventlogging-service work in MW Vagrant - https://phabricator.wikimedia.org/T126346#2011724 (Ottomata) NEW a:Ottomata
[16:31:12] joal, sure!
[16:31:33]