[00:43:01] kjschiroo, not sure. Worst case, you can always $ hdfs dfs -cat | openssl dgst -sha1 [00:47:51] I realized that hdfs must be doing something unexpected behinds the scenes. If the dump collection process dies, why can you just restart it with minimal impact? [14:12:32] o/ [15:07:33] * halfak curses at bluejeans [15:07:50] I missed my goal :( [15:31:26] halfak, I retract my previous statement, it has horrible impact. [15:31:56] Damn. File a big and then maybe a pull request? [15:32:03] * halfak adds kjshto the project [15:32:50] File a big? [15:32:52] oh bug? [15:32:55] *bug [15:34:27] Yeah, I was trying to figure out the solution right now. The problem that I have right now is that I don't know whether to trust the dumps that are on there. I don't know which ones were in progress when it died and if they went away or if they are removed since they didn't complete. [15:34:57] I was trying to figure out how to get a checksum, and have one way, but it looks like it is an md5. [15:35:07] Which appears to be different than an md5sum [15:35:23] joal and I are in a meeting for the next 45 minutes, but when we're done, I think he may have ideas. [15:35:48] md5 and md5sum should produce the same value. [15:36:43] kjschiroo: currently improving our scripts to include md5 checking [15:36:51] more info after meeting [15:41:56] joal: I was working on the same goal last night. I would be interested in seeing how you are doing it. I ran into the issue that hdfs_client.checksum(f_path) returns a checksum like 0000020000000000000800000abcdff2c49d52a0ed399037fbc0eaa500000000 when I expect something like this 0a5d50262a82efca0b0cf13cb7452d93. I must be missing something. [16:27:49] kjschiroo: Hi again [16:28:22] kjschiroo: hadoop checksums are different from linux ones (based on blocks) [16:28:40] kjschiroo: The way I have found is to compute md5 using pythomn [16:29:57] I was looking into that, didn't finish it yet though. How close are you to pushing it out? [16:30:07] kjschiroo: currently testing [16:30:20] kjschiroo: should be out later on today (I hope) [16:31:24] joal: This will only be for doing the checksum and validating the dumps and will not include the ability to restart them? [16:32:08] kjschiroo: actually including the md5 thing to download, so restart will be built in [16:32:52] Okay. Then I guess I will be looking forward to seeing it released :) [19:01:20] kjschiroo: I'm sorry, my debugging takes me a little more time than expected :( [19:01:45] kjschiroo: I'm logging off for tonight, will normally push code tomorrow during my day [19:01:53] Have a good evening kjschiroo [19:01:58] np :) [19:01:59] You too [20:04:40] Just joined documentation time, but I've got to let the dog out quick. [20:04:42] So BRB [20:04:59] dogcumentation [20:16:33] Hmm... Looks like Thursday documentation times are hard for others too. [20:16:41] I'm the only one here so far. [20:17:24] I think everybody is distracted watching a video about galaxies or something [20:17:39] (see -office) [20:23:04] * halfak listens to galaxy zoo talk. [20:23:19] I don't think that crowd-sourcing is "inexpensive" [20:23:32] Wikipedia is very expensive -- we just don't pay for it in $$. [20:23:43] We pay for it in volunteer time and attention. [21:18:57] Report! https://meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_edit_quality/Work_log/2016-04-14 [22:19:55] halfak: I tried doing worklogs this way, btw - http://paws-public.wmflabs.org/paws-public/User:YuviPanda/worklog/Untitled.ipynb [23:17:53] YuviPanda, I like the outline [23:18:08] Seems like good discipline