[07:47:00] !log stopping kafka on kafka1020.eqiad and rebooting the host for Linux 4.4 upgrades [07:47:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [07:51:06] elukey: joal: rate limits are being enforced now for pageview endpoints [07:51:34] mobrovac: goooooooood [07:51:35] Great mobrovac ! Thanks for having explain everything to Nuria :) [07:51:36] thanks! [07:52:27] joal: o/ [07:52:41] I am rebooting kafka1020 to install the 4.4 kernel [07:52:41] \o elukey [07:52:47] elukey: Just saw that: ) [07:52:51] let's see if my patch solves the mtime issue [07:53:03] That'd be awesome [07:54:20] !log EL restarted on eventlog1001 [07:54:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [08:18:46] joal: didn't see any "compaction" log in /var/log/kafka/log-cleaner.log after the restarts [08:18:58] buuuut need to wait a couple of hours to check that we are good [08:19:20] k elukey: looks like a victory, but let's wait for the referree count :) [08:19:55] :) [08:20:59] I also noticed something weird, namely that under replicated partitions errors stays up until I don't run preferred replica election (I also can see holes in topics --describe) [08:21:25] It maybe due to the fact that it takes a bit to replicate everything, buuut better double checking [08:23:04] ls [08:23:07] nope :) [08:26:53] !log deleted very old kafka.log files in /var/log/kafka to free root space [08:26:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [08:41:28] elukey: I have messed up yesterday :( [08:41:46] joal: aqs results? [08:41:48] elukey: I hope HaeB will have some backup and will be able to save my life [08:41:58] elukey: nope, mobile apps uniques jobs [08:42:37] elukey: Those jobs overwrite their data with augmented one [08:42:59] elukey: So it's strictly forbidden to run two instances of them in parallel ... Whic I did :( [08:43:26] elukey: 1 year and a half of data lost ;( [08:46:29] HaeB: You're my savior, I owe you much !!! [08:48:33] ouch :/ [08:49:11] it happens joal, finger crossed for the backup! [08:50:03] elukey: HaeB has it all ! pffff ... I breath anew [08:52:15] wooooooooooooo [08:53:09] elukey: repairing my mess [09:27:09] joal: kafka is still modifing mtime, I just checked the data logs on kafka1012 [09:27:13] *kafka1020 [09:27:14] sigh [09:27:20] hmff :( [09:27:24] all of them June 1 [09:39:36] elukey: but on the upside, the change to set the correct sysctl value for tha kafka brokers works fine, kafka1020 uses a 512k connection tracking table [09:42:13] moritzm: ah forgot to check that, good! :) [09:47:03] moritzm, joal - https://issues.apache.org/jira/browse/KAFKA-1379 [09:47:22] somebody answere in kafka-users@ [09:47:26] *answered [09:47:38] I was in Cc but didn't get the answer [09:54:24] opening a phab task to track everything.. :/ [10:03:32] Analytics, Analytics-Cluster: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690#2344329 (elukey) [10:03:58] moritzm: --^ didn't put you in Cc but if you wish, here's the phab tas [10:04:01] *task [10:08:28] thanks [10:19:30] Analytics, Analytics-Cluster: Kafka 0.9's partitions rebalance causes data log mtime reset messing up with time based log retention - https://phabricator.wikimedia.org/T136690#2344390 (elukey) [10:28:59] * elukey lunch! [10:31:02] hi team! :] [10:36:23] hi mforns :) [10:36:31] hi joal! [10:36:56] joal, is there anything I can help in cluster ops today? [10:38:12] mforns: I think we're back on track now that I've clean my mess [10:38:24] Thanks for asking :) [10:39:12] cool, yea, I wouldn't have thought that having 2 instances of a job would have such consequences [10:39:48] mforns: It doesn't necessary happen if writing times don't overlap (by chance) [10:39:55] I see... [10:40:35] But the probability of having 2 oozie jobs in parallel (which means a lot of single jobs in parallel) and overlap not happening is small [10:41:36] I've been lucky Tilman had backup [10:45:39] aha, everything is ok :] [10:52:14] mforns: I have spent a few minutes in stress though :) [10:52:34] mforns: And was lucky Tilman wass still awake [11:00:08] joal, hehehe I believe you [11:13:10] mforns, elukey : ssh -N stat1002.eqiad.wmnet -L 9091:stat1002.eqiad.wmnet:9091 [11:13:21] mforns, elukey : then http://localhost:9091/caravel/dashboard/3/ [11:43:00] joal: here I am.. Caravel? [11:44:57] mforns, joal : proceeding with another kafka reboot [11:49:46] joal, what is the user and pass? [11:49:52] admin [11:49:53] elukey, ok [11:50:16] thx [11:51:32] !log rebooting kafka1022 for kernel upgrades [11:51:34] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [11:55:34] !log restarted EL on eventlog1001 [11:55:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [11:57:09] joal, looks good! [12:01:08] what is caravel? [12:01:38] elukey: caravel is a dashboarding system built by airbnb I think [12:01:45] elukey: originay on top of druid [12:01:50] ahhh okok [12:12:42] joal, our sunbursts are better :P [12:12:48] mforns: true ! [12:13:23] mforns: tailored is usually always better than generic :) [12:13:33] joal, of course [12:15:10] joal, milimetric, I loaded the user_state intermediate table, but there are problems with the user_id refs... [12:17:41] mforns: can I help? [12:18:01] joal, do you want to batcave for a while? [12:19:45] sure mforns [12:19:52] OMW [12:53:17] joal, can I remove the exec(...) from the file or do you want me to comment it out? [12:53:29] as you want :) [12:53:32] you can remove [12:53:38] ok [13:41:58] moorning [13:42:03] elukey: o/ [13:42:06] look i did the wave! ^ [13:47:46] ottomata: o/ [13:47:48] the wave?? [14:02:19] elukey: https://phabricator.wikimedia.org/T126629#2344761 [14:02:30] elukey: not sure how closely you're following the 2.2 stuff [14:03:08] elukey: also, i wouldn't mind a second pair of eyes (if you find yourself curious :)) [14:06:00] urandom: wow, will check later on, thanks! [14:07:30] elukey: when you got a sec let's talk about kafka restarts and mtime and eventlogging [14:09:36] ottomata: sure! [14:10:11] I guess you saw https://phabricator.wikimedia.org/T136690 right? [14:10:47] ja [14:10:50] so, delete policy didn't help? [14:11:27] [PoC][WiP] Replace pykafka with latest (and snazzy!) kafka-python [14:11:28] hahahahah [14:11:34] (brb! ) :) hehehe [14:11:52] no help from delete policy :( [14:16:43] rats [14:17:10] elukey: do you have more kafka restarts to do? [14:19:19] ottomata: other 3, I was thinking to do them today [14:19:27] in order to align the purge day [14:19:33] aye [14:19:33] more or less [14:19:37] ok [14:19:47] now that you mentioned, I'll do another one :P [14:19:51] hang on [14:19:59] i kinda want to stand up my own el consumers [14:20:03] using current code with pykafka [14:20:07] and my wip code [14:20:12] to see if i can reproduce in prod maye [14:20:13] maybe [14:20:18] cause thus far I can't reproduce anywhere [14:20:24] ah ok I'll wait your green light [14:20:30] ok, can you give me an hour? [14:20:50] think i can be ready in an hour [14:20:52] yep sure, I am implementing the begin:|end: prefix for vk [14:20:59] k awesome! [14:21:04] I found a way [14:21:06] :) [14:21:37] elukey: i'm all for using the end of the varnish tag, i think that will at the least fix most problems...but would it be less error prone to use some timestamp at seq creation time? [14:23:42] ottomata: the main problem would be that the timestamp will not come from varnish tags but from vk timings [14:23:53] that shouldn't change much [14:24:04] but I'd prefer to keep things separated [14:24:34] maybe I can add another timestamp [14:24:43] for vk timings? [14:24:49] aye, not sure which is better [14:24:52] def using end from varnish is cleaner [14:25:06] but using seq time from vk would be better for our stuff [14:25:07] :/ [14:25:12] elukey: let's just do what you are doing now [14:25:14] with end time [14:25:17] and keep the other thing in mind [14:27:06] ottomata: Giuseppe updated his key, a "gpg2 --refresh-keys" should make pwstore commitable again [14:27:37] oook [14:29:35] ottomata: yeah maybe one thing at the time, if this change won't work I'll introduce seq timings too [14:30:13] moritzm: i still get Warning: the following recipients are invalid: 7EABB4778EA5218DF66D9D2FA24CD0703BB997C5. [14:30:23] does .users need updated? [14:30:38] elukey: cool sounds good [14:31:17] nope, same key ID, can you send me the output of " gpg2 --list-keys gvalagetto"? [14:31:23] nope, same key ID, can you send me the output of " gpg2 --list-keys glavagetto"? [14:32:37] moritzm: gpg: error reading key: No public key [14:32:46] i might have deleted that key id when i was trying to figure this out myself [14:33:31] can you try "gpg2 --recv-key 7EABB4778EA5218DF66D9D2FA24CD0703BB997C5"? [14:33:47] lookin ggood... [14:33:59] i was trying search-keys [14:34:28] hmm, pwstore still says key invalid [14:34:32] am trying refresh again [14:34:43] nope no good [14:34:49] oh [14:34:51] itsa new key [14:34:52] Warning: the following recipients are invalid: 5FF346D51268D1468A070853AB640DA3D40B305A. [14:34:53] what do you now get for "gpg2 --list-keys gvalagetto" ? [14:35:04] akosiaris = 5FF346D51268D1468A070853AB640DA3D40B305A [14:35:13] gpg2 --list-keys gvalagetto [14:35:13] gpg: error reading key: No public key [14:35:30] gpg2 --list-keys glavagetto@wikimedia.org works though [14:35:32] moritzm: [14:35:36] pub 4096R/3BB997C5 2015-04-30 [expires: 2018-05-17] [14:35:37] uid Giuseppe Lavagetto (WMF) [14:35:37] sub 4096R/BD0A0085 2015-04-30 [expires: 2018-05-17] [14:35:50] and do you have Alex's key? [14:36:03] i.e. gpg2 --list-keys akosiaris@wikimedia.org ? [14:36:29] ha nope [14:36:30] ok... [14:36:43] recving [14:37:11] man new key...ok trying for each i run into ... [14:38:07] I'm afk for about 1.5 hrs, let me know if it still fails later [14:38:10] k [14:38:45] ah, moritzm, apergos's is expired too [14:45:27] mforns: hi [15:10:01] elukey: with your ops super powas, can you copy files from stat1002 to aqs1004? [15:15:03] joal: in a meeting currently, brb in 15 mins! [15:15:09] np elukey [15:25:03] woowee stat1002 is bussayyyyy [15:28:36] joal: here I ammmm [15:29:16] so for the files we could use netcat, but how big is the amount of things to copy joal? [15:36:10] elukey: checking [15:36:48] elukey: 125M [15:37:29] elukey: this is aqs new aqs requests set to continue testing on 2 month of data [15:45:53] ottomata: any idea on copying data from stat1004 to aqs1004 ? [15:46:44] joal: then we can just tar.gz the files and do cat file.tar.gz | nc aqs1004.eqiad.wmnet 6666 [15:46:48] on stat1002/4 [15:46:50] joal: ja rsync daemon [15:46:53] and on aqs1004 [15:47:10] netcat -l -p 6666 > file.tar.gz [15:47:17] super simple :) [15:47:29] this only if port 6666 is open of course [15:47:43] elukey: I think it's not :) [15:48:04] its not, but, make an unpriveldged rsync daemon run on stat1004 [15:48:06] elukey: ACLs are kept clean in our subnetworks ;) [15:48:13] and then rsync pull from aqs [15:48:25] wow ottomata [15:48:34] * joal not knows about rsync daemons [15:48:44] lemme try, then i show you :) [15:49:07] joal: where is the data? [15:49:09] on stat1004? [15:49:17] stat1004 (HDFS, but will get it out) [15:49:27] aye ok [15:49:48] joal: we should be able to reach without issues aqs1004 from stat, the network ACLs should allow it (and rsync will be subject to the same problem in case) [15:49:50] heh might be able to rsync from hdfs mount ! [15:50:06] hmm [15:51:43] elukey: netcat doens't manage to connect I think [15:51:48] elukey: batcave for a minute? [15:52:35] joal: going to try now, but I'd trust ottomata more, at this time he has already set rsync probably :D [15:54:26] heheh, ya but now that i've said it...i'm not sure i've ever done it non priveledged... [15:54:43] def can do it with sudo... [15:55:51] mwarf [15:56:19] ottomata, elukey : for a smaller amount of dzta, I'd copy it on my computer, but 120M is rather biggy to cross the atlantic ! [15:58:03] yes got it! [15:58:09] ok joal, elukey, check it [15:58:11] on stat1004 [15:58:15] cat /home/otto/rsyncd.conf [15:58:26] then [15:58:45] rsync --daemon --no-detach --port 7654 --config /home/otto/rsyncd.conf [15:58:49] then on aqs1001 [15:59:00] trying ottomata [15:59:06] rsync --port 7654 -rvn stat1004.eqiad.wmnet::hdfs/user/otto/test1/ ./test1/ [15:59:11] your rsync daoemon is running ? [15:59:16] just killed it [15:59:18] k [15:59:24] launching it [15:59:25] (test1 is just some random dir i had in my hdfs homedir) [16:01:23] a-team: standddduppp [16:01:37] ottomata: worked like a charm ! [16:01:43] Thanks a lot :) [16:02:09] let's save the magic formula somewhere! [16:02:41] I have some trouble connecting to the batcave :( [16:03:34] elukey: can you try again? [16:04:08] nuria_: yeah I am doing it, currently restarting everything [16:05:36] madhuvishy: are you having problems joining too? [16:05:56] nuria_: I keep seeing "please wait" before joining [16:06:00] not sure why [16:06:06] have restarted chrome a couple of times [16:06:16] elukey: are you logged in with a non wikimedia ccount? [16:06:22] nope :) [16:06:54] a-team, hangouts logged me out and does not let me in... [16:07:01] :( [16:07:22] mforns: try to log with incognito session [16:07:39] I'm in [16:09:27] (PS3) Milimetric: Moves other beta feature reports where they belong [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/289004 (https://phabricator.wikimedia.org/T126549) [16:11:03] joal yaa [16:11:06] Y [16:11:34] ottomata1: I copied your rsync.conf file and commented it with the commands ;) [16:11:39] Thanks gain :) [16:20:23] yup! [16:20:34] elukey: ok, btw, got el processors running on stat1002 [16:20:41] woa nice! [16:20:44] 2 running pykafka, and 2 running kafka-python [16:20:45] will reboot after standup [16:20:52] each in their own consumer groups [16:21:31] I will start using "snazzy" more often now [16:21:35] haha ok [16:21:37] :D [16:41:22] elukey: goals? [16:41:27] yep coming [17:27:33] Analytics-Kanban, Patch-For-Review: Event Logging doesn't handle kafka nodes restart cleanly - https://phabricator.wikimedia.org/T133779#2345531 (Ottomata) a:Ottomata [17:33:28] ummmm [17:33:34] maybe i will just eat at this cafe i'm going to in a sec [17:33:41] milimetric: then i can work with you sooner on your limn puppet stuff [17:33:42] elukey: [17:33:50] ok [17:33:53] will you do a broker restart real quick before I go? [17:33:54] i'm in sos for 30 [17:34:05] milimetric, I will deploy dashiki changes, but is it ok to do it in a couple hours? [17:34:16] btw, apparently sweet websocket server: https://github.com/alexhultman/uWebSockets [17:34:28] mforns: anytime, we should catch up on the prelim tables too [17:34:31] I have questions [17:34:34] but now SoS [17:34:38] ooo cool milimetric [17:34:45] milimetric, I can do it by myself, I think I just will need the +2 in the puppet patch when it's time cc: ottomata [17:34:57] sure [17:35:02] ottomata: are you going afk? [17:35:15] elukey: was going to, but i can wait to just watch [17:35:16] because I am too, and I don't like rebooting a node and not paying attention to it :D [17:35:20] ooooooook [17:35:20] ahh okok [17:35:23] i guess i can do it [17:35:27] which one do you need to do next? [17:35:28] nono I can do it [17:35:32] well, i mean, i can do it later [17:35:37] and go get lunch now [17:35:38] ahh okok [17:35:43] i'd just do one so I can watch! [17:35:45] lemme grab the nodes [17:35:49] ja which remain? [17:35:55] i'll pick one or two and do ,a nd then let you know [17:35:58] milimetric: Can you share the json you used for loading druid? [17:36:02] Analytics-Kanban, Patch-For-Review: Browser report on analytics.wikimedia.org has broken icons - https://phabricator.wikimedia.org/T136217#2345555 (Krinkle) Open>Resolved [17:36:06] Analytics-Kanban, Patch-For-Review: Browser report on analytics.wikimedia.org has broken icons - https://phabricator.wikimedia.org/T136217#2327028 (Krinkle) Thanks! [17:36:16] kafka1018.eqiad.wmnet kafka1012.eqiad.wmnet kafka1014.eqiad.wmnet [17:36:24] linux 4.4 is already there [17:36:43] elukey: oh they need to be fully rebooted? [17:36:47] those 3? [17:37:45] yep! [17:37:50] yeah, we need to migrate the 3.19 systems to 4.4 since support for 3.19 end in about six weeks [17:38:43] joal: stat1002.eqiad.wmnet:/home/milimetric/index-pageviews.json [17:38:54] thx milimetric, will have a look [17:45:39] ottomata: first vk draft available - https://gerrit.wikimedia.org/r/#/c/292172/ [17:45:59] still need to figure out if those strstr will produce a perf hit [17:46:08] don't think so but I'd like to measure it :) [17:46:18] let me know if you see something that you don't like [17:47:29] we use wheels to deploy? [17:47:33] what do we deploy with wheels? [17:51:05] elukey: ok cool will try to find some time to look at it [17:51:09] btw decided not to go to cafe [17:51:14] am staying home til i have to go to my apt. [17:53:08] going afk team! have a good evening/afternoon :) [17:55:43] bye elukey :) [17:56:33] laters! [18:02:14] cd coreyfloyd [18:02:18] oops :) [18:02:20] sorry [18:16:42] !log stopping kafka broker on kafka1018 and rebooting node [18:16:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [18:25:57] Analytics-Kanban, Patch-For-Review: Event Logging doesn't handle kafka nodes restart cleanly - https://phabricator.wikimedia.org/T133779#2345713 (Ottomata) Tested using kafka-python in prod today while restarting a broker. The pykafka processes died with ``` 2016-06-01 18:17:09,165 (Thread-19 ) Updatin... [18:32:17] milimetric: can help if you wanna [18:33:41] with the deploy? [18:41:06] ottomata: you wanna do the deploy now? [18:41:14] is that what you meant with help? [18:43:32] I'm gonna go grab lunch, sorry, starving [18:44:11] uh, with limn something? [18:44:12] sure [18:59:06] Analytics: Puppetize job that saves old versions of geoIP database - https://phabricator.wikimedia.org/T136732#2345955 (Nuria) [19:32:32] (CR) Nuria: [C: 2 V: 2] "My BAD for not mentioning the monthly job in the prior CR." [analytics/refinery] - https://gerrit.wikimedia.org/r/291989 (owner: Joal) [20:11:31] a-team, I'm gone for tonight ! [20:55:37] mforns: hi [20:56:28] (CR) Milimetric: [C: 2 V: 2] Create ee-beta-features directory [analytics/limn-ee-data] - https://gerrit.wikimedia.org/r/289006 (https://phabricator.wikimedia.org/T126549) (owner: Milimetric) [20:56:34] (CR) Milimetric: [C: 2 V: 2] Create flow-beta-features directory [analytics/limn-flow-data] - https://gerrit.wikimedia.org/r/289005 (https://phabricator.wikimedia.org/T126549) (owner: Milimetric) [20:56:44] (CR) Milimetric: [C: 2 V: 2] Moves other beta feature reports where they belong [analytics/limn-language-data] - https://gerrit.wikimedia.org/r/289004 (https://phabricator.wikimedia.org/T126549) (owner: Milimetric) [20:58:40] I merged all those repos because I think it's pretty safe [20:59:05] basically it'll stop the updates from happening on those 2 reports until andrew merges the puppet [20:59:14] but when it runs again it'll just read the output file and be fine [21:03:48] oops, all reportupdater jobs didn't like that autocommit flag: [21:03:50] 2016-06-01 21:00:10,678 - ERROR - Report "content_translation_beta_manual" could not be executed because of error: MySQLdb can not connect to database ('autocommit' is an invalid keyword argument for this function). [21:14:04] Analytics, DBA, Editing-Analysis, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2346574 (Milimetric) Resolved>Open Since this was merged, all reportupdater jobs seem to fail with an error like: ```ERROR - Report "cont... [21:14:31] Analytics-Kanban, DBA, Editing-Analysis, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2346579 (Milimetric) a:Neil_P._Quinn_WMF>Milimetric [21:17:58] (PS1) Milimetric: Revert "Make SQL connections use autocommit" [analytics/reportupdater] - https://gerrit.wikimedia.org/r/292266 [21:18:08] (CR) Milimetric: [C: 2 V: 2] Revert "Make SQL connections use autocommit" [analytics/reportupdater] - https://gerrit.wikimedia.org/r/292266 (owner: Milimetric) [21:20:11] milimetric, hi! [21:20:32] hey mforns, if you're done for the day don't worry about all those pings, I got it [21:20:44] no I'm still here for a while [21:20:45] hey btw, I don't know if you saw my chat message in our chess game, good game, well played :) [21:21:00] you discovered my secret weakness: [21:21:03] oh! I didn't see it [21:21:13] XD what weakness? [21:21:21] if you make me play long games you'll kick my ass [21:22:00] ok, so if you're around though we can look at this together and catch up on the user data stuff [21:22:03] batcave? [21:22:06] sure! [21:22:14] give me 1 min [21:41:09] Analytics-Kanban, DBA, Editing-Analysis, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2346819 (Milimetric) Yep, we have MySQL_python==1.2.3 on stat1003 and we need 1.2.5 (confirmed on my local that we get the problem with 1.2... [22:17:36] Analytics-Kanban, RESTBase, Services, RESTBase-API, User-mobrovac: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2347369 (Nuria) @mobrovac : where can we see how often are we running into throttling limits? The dashboard seems pretty empty since this morning... [22:21:42] Analytics, Pageviews-API, Services: Rate limiting breached should be logged also when throttling is enabled - https://phabricator.wikimedia.org/T136769#2347385 (Nuria) [22:28:01] Analytics-Kanban, RESTBase, Services, RESTBase-API, User-mobrovac: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2347400 (Nuria) Never mind, throttling is no longer being logged, it should as we need to know how often does it happen thus I have filed a bug o... [23:46:02] Analytics, Pageviews-API, Services: Rate limiting breached should be logged also when throttling is enabled - https://phabricator.wikimedia.org/T136769#2347373 (GWicke) PR @https://github.com/wikimedia/hyperswitch/pull/45