[04:29:10] https://meta.wikimedia.org/wiki/User:Yurik/US_Politics_Real_Time [08:01:23] (PS3) Mforns: Improve the format of the browser report [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) [08:12:43] (PS4) Mforns: Improve the format of the browser report [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) [10:08:24] joal: helloooo [10:09:29] Hi :) [10:10:55] I am going to reboot kafka1014 [10:11:09] Perfect [10:11:10] EL should be super fine given the fact that we have the new patch [10:11:22] but.. if you could check it would be awesome :) [10:11:24] I'll monitor both EL and camus [10:11:29] :) [10:11:30] \o/ [10:13:52] ahhh we also have https://logstash.wikimedia.org/#/dashboard/elasticsearch/eventlogging-errors [10:15:29] elukey: weird thing: yesterday, at kafka reboot time, all oozie load job failed ... [10:15:44] * joal scratches his head in wonder [10:16:07] mmmm [10:16:10] kafka1014 stopped [10:16:51] elukey: log ! [10:18:02] !log rebooted kafka1014 for maintenance [10:18:09] Thanks elukey :) [10:18:22] elukey: From what I see, EL didn't break ! [10:18:26] Hurrqay ! [10:18:31] woooooowww [10:18:51] I'll wait a few minutes just to be sure, but lo9oks ok :) [10:19:02] One more problem less [10:22:49] elukey: ok, oozie jobs not related to kafka reboot (but to an1027 overloaded) [10:23:24] mediawiki/hhvm seems good [10:23:29] cool [10:25:49] elukey: Can you restart hue on an10267 please? [10:26:02] It kills the machine (therefore oozie is felling bad, etc) [10:26:06] !log ran kafka preferred-replica-election after the reboot [10:26:15] htop on an1027 will show you [10:26:29] sure [10:26:56] elukey: EL updated its kafka without failure --> WIM ! [10:27:36] elukey: camus, same :) [10:27:43] gooood [10:27:56] elukey: Looks like we can actually restart a kafka machine without breaking stuff ! YAY ! [10:27:59] mmm I am seeing that oozie is clogging 1027 though [10:29:13] oozie is partt of it yes, but when I asked you, hue (tomcat) was using 400% CPU [10:29:31] ottomata knows, we need to change that machine [10:30:28] all right hue restarted as requested :) [10:32:39] Thanks you elukey [10:32:50] Burrow alerted about the EL ERR status [10:33:00] but I guess it was only temporarly [10:33:14] I hope so :) [10:33:47] From the charts, me error :) [10:34:18] https://grafana.wikimedia.org/dashboard/db/eventlogging looks much better than yesterday :) [10:34:32] It dies :) [10:34:41] It DOES sorry :) [10:34:50] Man that typo is terrible :) [10:35:22] ahahhaahah [10:40:42] elukey: everythimn [10:41:02] elukey: everything looks fine, if you want you can proceed with other brokers [10:43:40] proceeding with 1018 [10:45:59] awesome [10:52:36] !log rebooted kafka1018 for maintenance [11:05:29] elukey: can you confirm there was a global restart of varnish yesterday during hour 14 UTC ? [11:06:10] I don't know but it might be a good question for the traffic channel [11:06:19] kafka1018 has some trouble coming up [11:06:27] Shit :( [11:07:12] elukey: hints on the reasons or not even ? [11:08:11] the management console is not reachable too [11:08:18] so it is a bit a mistery [11:10:03] Man, feels like 1012, no ? [11:11:37] so ops said that it might be the machine completely fried [11:11:39] :D [11:11:55] so I am going to launch a leader election just in case [11:12:03] Ah man ... We shouldn't have talked about those machines waranty period ;) [11:12:05] and remove 1018 from mediawiki-config [11:17:44] !log issued a kafka preferred-replica-election [11:31:26] joal: I saw a spike in errors in logstash for EL, but the dashboard looks good now [11:33:34] elukey: I can't say if the errors are related, but I don't see issues on python logs [12:09:09] joal: kafka1018 is up noww [12:10:57] great elukey ! [12:11:00] What was it ? [12:11:38] wrong fstab, like for 1012.. [12:11:54] rhmmm. Would have guessed it [12:12:19] reeeeally weird [12:12:24] anyhow, now it works [12:12:25] seems so yeah ! [12:12:30] You rock man :) [12:13:16] From what I see in EL logs, everything is smooth (normal errors on metadata change, renewing metadata, back in track --> what we expect :) [12:13:27] Same for camus [12:13:28] !log launched Kafka preferred-replica-election [12:13:55] joal: I wish! But thanks! :) [12:14:44] elukey: You HAVE TO accept some compliements :) [12:16:35] joal: from you guys always! [12:22:45] (CR) Joal: [C: 2 V: 2] "Should have been merged at revert time." [analytics/refinery] - https://gerrit.wikimedia.org/r/271293 (owner: Nuria) [12:25:20] all right, going ahead with the reboots [12:25:23] two more to go [12:25:50] elukey: Can you confirm yesterday there have been a varnish-kafka reboot around 14UTC? [12:25:56] elukey: sure, reboot ! [12:28:55] 14:02 bblack: package upgrades on cp* commence - https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:21] most probably for the glib bug [12:30:32] great elukey :) [12:30:32] yeah, these were for the glibc update [12:30:51] elukey: this was on ops channel ? [12:31:12] Oh, admin log, I see ! [12:31:15] Thanks :) [12:31:36] I don't have the reflect to look at ops admin log [12:31:40] I should :) [12:33:40] I just got a question if there's a simple way to find out which Wikipedia articles/pages have the most visits from Sweden, regardless of language version. Is there? [12:36:45] JohanJ: You can do that in hive using projectview tab [12:38:43] elukey: I can tell you just changed soemthing :) [12:39:28] 1020 stopped :D [12:39:33] hehe [12:39:43] EL spits some logs :) [12:41:26] !log kafka1020 rebooted [12:45:16] kafka1020 is up and running [12:45:18] better this time [12:45:30] Not each of them, cool ! [12:46:04] !log launched preferred-replica-election [12:46:44] joal: thanks! [12:52:14] np JohanJ :) [12:52:42] JohanJ: let us know if you need query review :) [12:53:05] has the deprecation of the mobile partition happened yet? (updating some scripts) [12:53:19] Ironholds: it has, in mif-jan [12:53:22] mid-jan [12:53:44] joal, great! You're the best :D [12:54:00] Ironholds: Trying my best to be better, for sure :) [12:54:43] I mean, isn't that the best anyone can do? :D [12:54:47] Who am I better than? I'm better than I used to be/I'mma keep on getting better so you better just get used to me [12:55:13] hehe, I'll never get used to you, you change all the time (for the best ;) [12:55:32] * Ironholds blushes [12:55:37] !log rebooted kafka1022.eqiad.wmnet [12:59:49] 1022 up and running! kafka reboots completed moritzm :) [13:03:44] !log executed kafka preferred-replica-election [13:04:30] elukey: YOU WIN !! [13:04:47] elukey: no EL downtime today ! [13:05:00] elukey: no camus downtime (hopefully, deopending on an1027) [13:05:08] Yay ! [13:05:29] joal: we also need to restart all the hadoop services on analytics* today... ;) [13:06:01] elukey: Reaaaaaaaaaally ? [13:07:28] yep.. glibc upgrade.. [13:07:30] https://grafana.wikimedia.org/dashboard/db/server-board?from=1455714421750&to=1455800581750&var-server=analytics1027&var-network=eth0 [13:07:39] rhm [13:07:42] Ok then :) [13:08:16] elukey: We'll be clever, and stop camus while restarting resource-manager and namenode [13:08:28] something is using a lot more RAM than before, ending up in swapping probably [13:10:17] atm is fine though (on the box) [13:10:28] ok elukey [13:10:35] I don't understand :( [13:11:45] ah the OOM killer has been invoked, I think it was pissed off by the noise :P [13:12:43] commuting to the office, then I'll restart the investigation. brb 30 min [13:31:07] a-team, leaving for a moment, trying to clear my head of hive weirdness [13:38:18] :) good luck [13:42:59] Days like that: Hey, it's sunny, let's get out ! And then it starts raining ... [13:43:31] milimetric: Hi ! [13:43:53] it only starts raining above your head too, follows you around [13:43:54] :) [14:08:16] joal, so when you get back, I'm encountering some serious hive weirdness [14:09:07] a simple query produces about 200 messages along the lines of "Feb 18, 2016 1:49:48 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728" and totally messes up the table formatting [14:09:28] it makes it impossible to read in the query results as the TSV they are :/ [14:10:07] is there any way I can disable these messages, or fix whatever the underlying problem is? [14:16:43] a-team, I'm on-call and I'm not sure what to do about Burrow alerts [14:16:54] can someone tell me so I can document it here: https://wikitech.wikimedia.org/wiki/Analytics/Oncall [14:17:06] oh, duh, it's here: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall [14:17:38] * milimetric 's brain sometimes only processes something when he asks questions [14:18:45] elukey: I love this so much: https://wikitech.wikimedia.org/wiki/File:EventLogging.png [14:21:13] milimetric: thanks! :) [14:21:42] so burrow's alert were due to my kafka restartst [14:21:49] *restarts [14:22:03] EL handled them correctly with Andrew's patch, we are good :) [14:23:26] Also I added some info in the email template about how to debug the issue, and Nuria added some EL specific notes in https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall#Consumption_Lag_.22Alarms.22:_Burrow [14:24:23] but noww.. I need to restart yarn and datanode on all the analytics* hosts for the last security updates [14:24:31] in batches :D [14:24:45] right, I read that, but the alerts still make no sense to me [14:24:56] I'm trying to read https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules but it's not making sense to my brain [14:25:07] the alert says: [14:25:08] eventlogging-server-side:0 (1455799472454, 315783539, 0) -> (1455800426596, 315809273, 0) [14:25:18] which is supposed to be (timestamp, start-offset, start-lag) -> (timestamp, end-offset, end-lag) [14:25:28] so the start-lag is 0 and the end-lag is 0? [14:25:29] what?! [14:26:27] I think that in this case it was an ERR state, so probably complete failure to read.. I believe it was due to the fact that the consumer was trying to read from the broker restarted, that was the partition leader [14:26:39] then a metadata update cleared up the confusion [14:26:57] that makes sense, but how do I get that from the burrow alert [14:28:19] well I usually check Status first, and then the 0 -> 0 values that you pointed out suggested the consumer completely stalled [14:28:52] but probably I am missing your point about the confusion, because usually you explain to me how things work :D [14:29:43] Evalutation rules: 2) If the consumer offset does not change over the window, and the lag is either fixed or increasing, the consumer is in an ERROR state, and the partition is marked as STALLED. [14:31:55] anyone else have an idea with the hive weirdness above? [14:32:34] Hey Ironholds, sorry I left for 1/2 hour [14:32:38] np! [14:32:45] Ironholds: I have read the parquet message before [14:32:48] Ironholds: no, I was gonna wait for Joseph to look at it, but ... | grep -v Feb ? :) [14:32:52] I don't know what they are about though :( [14:33:21] elukey: I think I'm not certain about terminology [14:33:28] A way to get no logs from hive is by starting it with -s (silent) option [14:33:32] Ironholds: --^ [14:33:33] joal, thanks! [14:33:37] so "window" means the gap between the left and right side of that arrow? [14:33:39] I hope it does the trick [14:33:40] I'll experiment with that and see if it removes that class of error :) [14:34:07] I can't explain burrow because I wasn't on call since they installed it, so it's brand new to me [14:34:56] elukey: so if that's the window, the consumer offset *did* change over the window [14:38:07] mmmm true [14:40:24] ok, well, everything seems to be in order, I double checked, so for now the only problem is me not understanding burrow [14:40:40] I'm gonna take a shower and join the trees in -aspen to get some work done [14:43:36] milimetric: could be that it was applied rule 4, I'll double check.. anyhow, it was due to me stopping brokers almost for sure :) [14:43:52] we could change the format of the email to include deltas [14:43:59] in the (,,) -> (,,) info [14:44:13] so one could spot consumer offset delta very easily [14:47:18] joal, do you prefer to stop camus now or just when we restart the master nodes [14:47:21] ? [15:01:52] joal, it doesn't remove the warnings [15:02:00] could it be something to do with the logging warnings on query launch? [15:02:03] like, it's trying to log inline? [15:07:34] Ironholds: hm, weird [15:07:48] elukey: only for master node should be ok [15:08:03] super, starting the node restarts then [15:09:40] Ironholds: Can you give me an example query ? [15:09:54] joal, SELECT uri_host, uri_path, webrequest_source FROM webrequest [15:09:54] WHERE year = 2016 AND month = 01 AND day = 25 [15:09:54] AND uri_host IN('www.wikipedia.org', 'wikipedia.org') AND content_type RLIKE('^text/html') [15:09:54] LIMIT 200; [15:10:24] (CR) Jcrespo: "I hope dbs.json is automatically generated and there is a plan to keep that up to date?" [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) (owner: Alex Monk) [15:10:26] (PS5) Mforns: Improve the format of the browser report [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) [15:10:45] I'm experimenting with upping the threshold for hive messages from INFO to ERROR in case this is the consequence of our failure to institute a logging framework, and hive is just outputting it inline [15:11:03] !log restarting hadoop services on analytics103* hosts for security upgrades [15:12:34] (CR) Jcrespo: "I see it now. I was going in order." [analytics/quarry/web] - https://gerrit.wikimedia.org/r/266925 (https://phabricator.wikimedia.org/T76466) (owner: Alex Monk) [15:17:42] joal: I restarted 10 nodes, all good for the moment. Going to check yarn and hue [15:18:17] coool, elukey, just checking, you are doing 1 at a time, ja? [15:18:27] Ironholds: have you find an answer ? [15:18:54] ottomata: 3 at the time (one daemon for each batch) - too many? [15:19:11] Ironholds: I have a solution, not perfect though [15:19:23] Ironholds: it involves using beeline instead of hive :S [15:19:49] elukey: monitoring camus, no errors so far [15:19:58] ottomata: also today kafka1018 showed the same fstab issue that we saw on kafka1012 :( [15:20:33] :? [15:20:34] ! [15:20:37] wrong entry? [15:20:50] elukey: its probably ok, iusually just do 1 at a time to be nice to jobe [15:20:51] jobs [15:21:06] joal, not yet, and we've run into weird problems with beeline [15:21:06] any halfway complete jobs on the nodes you restart will have to start from the beginning elsewhere [15:21:14] like mikhail tried switching over to it and had to switch back [15:21:26] but I'll let you know if my logging setting tweakery works when the query completes [15:21:35] ottomata: yep sure, I'll reduce the parallelism, thanks for the suggestion [15:22:21] joal: probably not too interseting at this point: https://github.com/andrewstevenson/stream-reactor/tree/master/kafka-connect###Kafka-Connect-Cassandra [15:22:29] but in case we start using kafka connect, here's a cassandra writer [15:22:30] (PS6) Mforns: Improve the format of the browser report [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) [15:22:43] (CR) Mforns: Improve the format of the browser report (4 comments) [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) (owner: Mforns) [15:22:47] ottomata: good to know :) [15:23:20] (CR) Mforns: "Ready for review." [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) (owner: Mforns) [15:23:53] (PS7) Mforns: Improve the format of the browser report [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) [15:23:57] Ironholds: Managed to have results from your query without a single log line using beeline ... [15:25:14] joal, thanks! I will hold on beeline-based solutions until this afternoon when bearloga comes online and we can go over the issues he had with beeline, if that works? [15:25:21] are the log output and result outputs not on stderr and stdout? [15:25:23] repectively? [15:25:55] ottomata, answering a question with a question; in unix if you just do foo > bar without specifying where the errors go will it output stdout and stderr to the same file? [15:25:58] because that could be the problem [15:25:59] elukey: are you done with kafka reboots? [15:26:06] Ironholds: no [15:26:09] that should just be stdout [15:26:16] ottomata, then they're the same pipe, yeah [15:26:19] boooo [15:26:21] :/ [15:26:25] the weird partition warnings are coming out in the output file [15:26:36] what are the partition warnings? [15:26:42] the query is structured as a unix call of hive -f query_file > output_file.tsv [15:26:49] "Feb 18, 2016 1:49:48 PM INFO: parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728" and similar [15:26:57] lots of parquet partition and block complaints [15:27:14] huh [15:27:28] that is quite dumb [15:27:55] ottomata, yeah, I'm experimenting with limiting output to just ERROR rather than INFO, in case it's a problem of the defaults being "write everything to the CLI" [15:27:57] Ironholds: stupid work around: hive ... | grep -v 'INFO: ' > output_file.tsv [15:27:57] ? [15:28:07] that'll be my backup, yeah :( [15:28:09] yeah that could maybe work too [15:28:23] (not that it's a bad solution, it's just: whenever you're using unix utils to work around a system someone silly designed that system) [15:28:32] yeah [15:34:58] ottomata: yep kafka reboots completed [15:35:06] I just need to push the change for mediawikiconfig [15:35:41] how did eventlogging behave? [15:35:44] Ironholds: no prob for me :) [15:36:03] joal, how are you running it? [15:36:07] ottomata: super good, some burrow alerts but I suspect due to metadata changes [15:36:47] Ironholds: beeline -u jdbc:hive2://analytics1027:10000 -n --silent=true [15:36:54] *thumbs up* [15:37:06] I will try that if Mikhail clears it and the log tweakery doesn't help [15:37:23] !log restarting hadoop services on analytics102* nodes for security updates [15:37:33] But Ironholds, you also need to specify the format (default format is tabular with | blah | etc) [15:37:38] I was looking into that [15:37:47] *nod* [15:38:42] Ironholds: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients [15:39:17] Ironholds: beeline options: --outputformat=[table/vertical/csv/tsv/dsv/csv2/tsv2] (default table) [15:39:51] thanks! [15:39:54] np :) [15:40:02] Interesting for me as well Ironholds :) [15:40:16] I actually want to switch to beeline :) [15:40:59] ditto! [15:41:05] aaargh [15:41:09] okay, changing the logging doesn't work [15:41:18] I give up on this until bearloga is awake and we can chat the beeline issues he had [15:41:26] sounds good [15:41:43] Ironholds: Depending on he time, I might be away, but please let me know (email or whatever) [15:42:16] shall do! [15:42:43] Analytics-Kanban: Update AQS yaml format to match new convention {melc} - https://phabricator.wikimedia.org/T127323#2039784 (Milimetric) [15:44:11] !log restarting hadoop services on analytics104* nodes for security updates [15:46:15] (PS1) Milimetric: Use {{...}} in configs [analytics/query-service] - https://gerrit.wikimedia.org/r/271535 (https://phabricator.wikimedia.org/T127323) [15:49:33] joal: yt? [15:52:25] Analytics-Kanban: Parse User-Agent strings with OS like "Windows 7" correctly into the user agent map {hawk} - https://phabricator.wikimedia.org/T127324#2039833 (mforns) [15:52:47] Analytics: Parse User-Agent strings with OS like "Windows 7" correctly into the user agent map {hawk} - https://phabricator.wikimedia.org/T127324#2039845 (mforns) [16:06:22] (PS1) Milimetric: Fix handling of encoded and spaced article titles [analytics/query-service] - https://gerrit.wikimedia.org/r/271540 (https://phabricator.wikimedia.org/T126669) [16:06:40] nuria: I am [16:06:45] nuria: wasasup? [16:07:11] Analytics, Analytics-Kanban, Pageviews-API: Caching on pageview API should be for 1 day [1 pts] - https://phabricator.wikimedia.org/T127214#2035871 (Milimetric) a:Milimetric [16:09:56] joal: in 1 on 1, will talk in a bit [16:10:09] sure nuria [16:11:53] (PS1) Milimetric: Increase caching to 1 day [analytics/query-service] - https://gerrit.wikimedia.org/r/271542 (https://phabricator.wikimedia.org/T127214) [16:12:40] Analytics-Kanban: Create reportupdater browser reports that query hive's browser_general table {lama} - https://phabricator.wikimedia.org/T127326#2039921 (mforns) [16:13:39] Analytics-Kanban: Puppetize reportupdater to be executed in stat1002 and run the browser reports {lama} - https://phabricator.wikimedia.org/T127327#2039936 (mforns) [16:16:03] Analytics-Cluster, Operations, hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2039959 (mark) Old RESTbase machines indeed seem good for these, but won't be available for use for realistically another 2 weeks at least. [16:16:42] Analytics-Kanban: Puppetize reportupdater to be executed in stat1002 and run the browser reports {lama} - https://phabricator.wikimedia.org/T127327#2039960 (mforns) [16:19:44] Analytics-Kanban, Operations, Ops-Access-Requests, Patch-For-Review: All members of analytics team need to have sudo -u hdfs on cluster {hawk} [2 pts] - https://phabricator.wikimedia.org/T126752#2039966 (fgiunchedi) Open>Resolved merged, access will be granted at the next puppet run, please re... [16:26:04] ottomata: is that you having restarting load process for hours 13 ? [16:26:17] eh? [16:26:21] no? [16:26:23] not me [16:26:39] ok :) [16:27:11] joal: did you re-started the last access monthly? 0025050-160202151345641-oozie-oozi-W last_access_uniques_monthly-2016-1-wfRUNNING hdfs - 2016-02-18 16:19 GMT - [16:27:44] nuria: I have spent my day investigating why it failed, and I think I have found it (testing currently, should know before standup) [16:27:50] joal: k [16:28:30] * joal is dancing in joy in his small office, while crying of sadness at the same time [16:29:09] !log restarting eventlogging with pykafka 2.2.0 [16:30:06] ottomata: By the way, your patch worked very nicely this afternoon, I have monitored a python executor, and it was really nice to see (very verbose, but nice :) [16:30:43] great! [16:31:19] did you manage to create then new deb with the updated gevent? [16:31:23] madhuvishy: standddupppp [16:31:24] or just the patch? [16:31:39] cc elukey standdduppp [16:31:47] (PS1) Joal: Correct last_access_uniques oozie jobs [analytics/refinery] - https://gerrit.wikimedia.org/r/271550 [16:32:00] (PS3) Joal: Add monthly top to cassandra and correct jobs [analytics/refinery] - https://gerrit.wikimedia.org/r/270921 (https://phabricator.wikimedia.org/T120113) [16:32:25] a-team internet issues trying to join [16:33:37] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Refactor analytics/cdh roles to use hiera, setup Analytics Cluster in beta labs. [21 pts] - https://phabricator.wikimedia.org/T109859#2040030 (Nuria) Open>Resolved [16:33:41] Analytics-Kanban, Patch-For-Review: camus-wediawiki job should run in production (or essential?) queue {hawk} [1 pts] - https://phabricator.wikimedia.org/T125967#2040045 (Nuria) Open>Resolved [16:37:39] madhuvishy: hangout not working so well, might be just that [16:44:20] (CR) Yurik: [C: 1] "thanks!!!" [analytics/query-service] - https://gerrit.wikimedia.org/r/271540 (https://phabricator.wikimedia.org/T126669) (owner: Milimetric) [16:54:15] !log restarting hadoop services on analytics105* nodes for security updates [16:54:32] ottomata: can you confim me the spark version we'll use on the new CDH ?version [16:58:24] 1.5..... [16:58:44] 1.5.0 [16:59:35] ottomata: mind to keep me in the loop for CDH upgrade? [16:59:36] :) [17:00:04] certainly! [17:00:12] elukey: https://etherpad.wikimedia.org/p/analytics-cdh5.5 [17:02:07] \o/ [17:02:26] ottomata: I am good for analytics1001/1002 [17:02:43] I am going to stop camus on 1027 though before starting [17:03:34] then stop yarn on 1001, manual failover hdfs to 1002, restart 1001, manual failover from 1002 to 1001, restart 1002 [17:05:34] ok [17:05:39] nuria: got kicked out again - i'm gonna go get breakfast [17:05:46] madhuvishy: k [17:05:47] sound good elukey! [17:12:12] (CR) Nuria: Correct last_access_uniques oozie jobs (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/271550 (owner: Joal) [17:14:36] ottomata: no API breakage for us (at least at compilation time) from moving to spark 1.5 [17:15:17] (PS2) Joal: Correct last_access_uniques oozie jobs [analytics/refinery] - https://gerrit.wikimedia.org/r/271550 [17:15:36] cool [17:15:38] thanks joal [17:16:05] (CR) Joal: Correct last_access_uniques oozie jobs (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/271550 (owner: Joal) [17:16:18] (CR) Nuria: "Looked at patch #5 comments, looking now at latest patch." (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) (owner: Mforns) [17:16:59] nuria, the changes between 5 and latest are minimal, just a rebase and typo in documentation [17:17:08] just fyi [17:17:31] joal: stopping camus on 1027 ok? [17:17:38] please elukey ! [17:18:10] Analytics-Cluster, Operations, hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2040281 (RobH) @Ottomata: Can you advise if waiting for these is acceptable? The new restbase machines arrive next week, and we'll likely need to g... [17:18:14] mforns: k [17:18:15] elukey: actually, stop puppet, remove from cron, wait for the current job to finish [17:18:24] :) [17:19:48] !log disabled puppet and camus on analytics1027 [17:20:23] Analytics-Cluster, Operations, hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2040286 (Ottomata) @RobH, we'll wait. I might move these services in the mean time to a spare old Analytics Dell. We won't be in such a time crunc... [17:20:50] lunch, back in a bit [17:20:51] joal: just checking the process "camus" on ps? [17:21:03] or is there another more intelligent thing to do? [17:21:38] Analytics-Cluster, Operations, hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2040290 (RobH) Excellent, so with this info we can let @mark review and approve (since his only listed question was the waiting period until these a... [17:28:54] (CR) Nuria: Improve the format of the browser report (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) (owner: Mforns) [17:29:38] (CR) Nuria: "I think SQL is pretty clear, if you have tested this on cluster I think this is ready to merge. I am for removing older reports." (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) (owner: Mforns) [17:31:43] (CR) Mforns: [C: -1] "OK, thanks Nuria!" [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) (owner: Mforns) [17:32:27] (CR) Nuria: [C: 2 V: 2] "Merging." [analytics/refinery] - https://gerrit.wikimedia.org/r/271550 (owner: Joal) [17:32:53] joal: retro [17:32:57] joal: on batcave? [17:47:26] !log manual failover of hadoop master node (analytics1001) to secondary (analytics1002) for maintenance (plus service restarts) [17:54:53] Excuse me a-team for the sudden leave and retro missing :( [18:02:11] joal: notes on etherpad: https://etherpad.wikimedia.org/p/analytics-retrospective [18:02:34] a-team: joining for grooming though [18:06:25] Analytics, Analytics-Kanban, Pageviews-API, Patch-For-Review: Caching on pageview API should be for 1 day [1 pts] - https://phabricator.wikimedia.org/T127214#2040519 (Milimetric) [18:06:37] Analytics, Analytics-Kanban, Pageviews-API, Patch-For-Review: Pageviews's title parameter is impossible to generate from wiki markup {melc} - https://phabricator.wikimedia.org/T127034#2040521 (Milimetric) [18:06:46] (CR) Ppchelko: [C: 2 V: 2] "LGTM" [analytics/query-service] - https://gerrit.wikimedia.org/r/271535 (https://phabricator.wikimedia.org/T127323) (owner: Milimetric) [18:07:19] Analytics, Analytics-Kanban, Pageviews-API, Patch-For-Review: Caching on pageview API should be for 1 day [1 pts] - https://phabricator.wikimedia.org/T127214#2040528 (Milimetric) [18:07:29] Analytics, Analytics-Kanban, Pageviews-API, Patch-For-Review: Pageviews's title parameter is impossible to generate from wiki markup {melc} - https://phabricator.wikimedia.org/T127034#2040531 (Milimetric) [18:07:43] Analytics-Kanban, Patch-For-Review: Caching on pageview API should be for 1 day [1 pts] - https://phabricator.wikimedia.org/T127214#2040533 (Milimetric) [18:07:50] Analytics-Kanban, Patch-For-Review: Pageviews's title parameter is impossible to generate from wiki markup {melc} - https://phabricator.wikimedia.org/T127034#2040534 (Milimetric) [18:08:21] elukey: how is upgrade doing ? [18:09:13] joal: analytics1002 is the new master for yarn and hdfs but 1001 is now in safe mode, I need to wait a bit before being able to transition it to Active [18:09:24] ok cool :) [18:09:39] thanks for headup elukey :) [18:09:48] Failover from analytics1002-eqiad-wmnet to analytics1001-eqiad-wmnet successful [18:09:51] ta daaaa [18:10:52] Yay ! [18:11:44] Analytics-Cluster, EventBus, Operations, Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2040574 (RobH) [18:11:46] elukey: Seems everything went fine ! [18:12:58] all right yarn is primary on 1001 too [18:13:10] re-enabling camus [18:13:28] elukey: nodename done as well? [18:13:43] yep [18:13:54] Great [18:14:21] elukey@analytics1001:~$ sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState analytics1001-eqiad-wmnet [18:14:24] active [18:14:37] elukey@analytics1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState analytics1001-eqiad-wmnet [18:14:40] active [18:14:55] Analytics, Pageviews-API: Accept a www. prefix for mediawiki and wikidata - https://phabricator.wikimedia.org/T127030#2030197 (Milimetric) [18:14:58] * elukey dances [18:16:03] elukey: I monitor the next run ! Good job mate :) [18:17:20] Analytics, Pageviews-API: Wikimedia pageviews API blocked by ad blockers - https://phabricator.wikimedia.org/T126947#2027714 (Milimetric) We'll look into this but we don't really control adblock software [18:17:24] !log re-enabled puppet and camus on analytics1027 [18:17:29] Analytics, Pageviews-API, Services: Better error messages on pageview APi - https://phabricator.wikimedia.org/T126929#2040612 (Milimetric) [18:17:44] Analytics, Dumps-Generation: Avoid duplication of effort when processing dumps somehow - https://phabricator.wikimedia.org/T126809#2024007 (Milimetric) p:High>Normal [18:17:49] Analytics, Dumps-Generation: Provide a way to check if a dump has been generated - https://phabricator.wikimedia.org/T126808#2023999 (Milimetric) p:High>Normal [18:19:29] BTW [18:19:47] so awesome that there are people that can manage kafka/hadoop restarts other than me [18:19:49] yayyYyYyYyyY [18:19:53] thank you elukey :) :) :) [18:21:29] ottomata: well thank you for the patience :) [18:21:39] ottomata, elukey: jmxtrans is not listed at https://wikitech.wikimedia.org/wiki/Service_restarts#Hadoop_workers , it should probably be restarted along, right? [18:22:13] 1001's hdfs wasn't collaborating to become the master back so I stopped/started the process, waited for the safe mode to finish and transitioned it to active [18:22:19] hope that it was ok [18:22:34] Analytics: Fix layout of the daily email that sends pageview dataset status - https://phabricator.wikimedia.org/T116578#2040687 (Milimetric) [18:22:47] Analytics, Pageviews-API, Services: Better error messages on pageview API - https://phabricator.wikimedia.org/T126929#2040688 (Nuria) [18:23:20] elukey: that sounds fine [18:23:25] moritzm: hm, if it needs to [18:23:42] it usually doesn't matter, and puppet will restart it on a config change [18:23:44] so i usually don't htink about it [18:23:50] but if you are trying to restart all JVMs [18:23:51] than yeah [18:23:54] then* [18:24:07] Analytics-Kanban, Patch-For-Review: Update AQS yaml format to match new convention {melc} - https://phabricator.wikimedia.org/T127323#2040716 (Pchelolo) @Milimetric We're trying to follow the server semantics with #hyperswitch and all the dependencies, so that braking changes shouldn't be automatically pic... [18:24:37] Analytics, Analytics-Cluster: Capacity planning for cluster - https://phabricator.wikimedia.org/T116661#2040724 (Milimetric) Open>Resolved a:Milimetric [18:24:42] ottomata: it's local to each service, so it's just a matter of a single salt run, no need to have one at a time, right? [18:25:27] that's right moritzm [18:25:27] Analytics: Parse User-Agent strings with OS like "Windows 7" correctly into the user agent map {hawk} - https://phabricator.wikimedia.org/T127324#2039833 (Milimetric) needs a user-agent upgrade, can check on the UA Parser github page [18:25:36] they are stateless, you can restart them at any time [18:25:45] Analytics: Fix layout of the daily email that sends pageview dataset status - https://phabricator.wikimedia.org/T116578#2040740 (Milimetric) p:Triage>Normal [18:25:52] Analytics: Parse User-Agent strings with OS like "Windows 7" correctly into the user agent map {hawk} - https://phabricator.wikimedia.org/T127324#2039833 (Milimetric) p:Triage>Normal [18:26:37] Analytics, Pageviews-API: Accept a www. prefix for mediawiki and wikidata - https://phabricator.wikimedia.org/T127030#2040752 (Milimetric) p:Triage>Normal [18:27:04] Analytics: Use spares to test Druid in production - https://phabricator.wikimedia.org/T116930#2040754 (Milimetric) Open>declined a:Milimetric [18:27:22] Analytics, Pageviews-API: Accept a www. prefix for mediawiki and wikidata - https://phabricator.wikimedia.org/T127030#2040770 (Yurik) @Milimetric - I think this is not exactly the "www" prefix, but rather always allowing canonical domains in addition to whatever shorter versions we have. When used from... [18:27:25] Analytics: Data Store hardware procurement - 2015 - https://phabricator.wikimedia.org/T117008#2040771 (Milimetric) Open>Resolved a:Milimetric [18:28:19] ottomata: ok, will do that [18:28:45] Analytics: Fix the pageview API "top" spec and 404 reporting {slug} - https://phabricator.wikimedia.org/T117018#2040779 (Milimetric) p:Triage>Normal [18:28:54] Analytics, Datasets-General-or-Unknown: Inspect Pageview API queries (after launch ) {slug} - https://phabricator.wikimedia.org/T117242#2040785 (Milimetric) p:Triage>Normal [18:29:14] Analytics: Add partitions to webrequest text and upload topics - https://phabricator.wikimedia.org/T127351#2040795 (JAllemandou) [18:29:21] elukey: I had a look at the processes on 1028, 1035 and 1052, the hdfs around is from the hadoop-hdfs-journalnode service (only present on these three hosts) [18:29:43] (I initially suspected it was a case of a failed restart via stale PID or so) [18:29:46] ahhhhh [18:29:53] I forgot those! [18:29:55] you're right! [18:30:00] restarting them one at the time [18:30:23] let's doublecheck with ottomata, but should be fine from what I can tell [18:30:25] Analytics: Add partitions to webrequest text and upload topics - https://phabricator.wikimedia.org/T127351#2040795 (Milimetric) p:Triage>Normal [18:30:29] Analytics: Add partitions to webrequest text and upload topics - https://phabricator.wikimedia.org/T127351#2040795 (Milimetric) p:Normal>High [18:30:44] Analytics: Invalid page titles are appearing in the top_articles data - https://phabricator.wikimedia.org/T117346#2040819 (Milimetric) p:Triage>Normal [18:30:51] moritzm: yep they are quorum based, so one at the time is fine [18:31:27] Analytics-Cluster, EventBus, Operations, Services: Investigate proper set up for using Kafka MirrorMaker with new main Kafka clusters. - https://phabricator.wikimedia.org/T123954#2040828 (RobH) [18:31:50] ok, could you please also add this to the "Service restarts" page? [18:32:00] Analytics: Create API functionality to see the date ranges available - https://phabricator.wikimedia.org/T117361#2040833 (Milimetric) Open>declined a:Milimetric Hard-ish to do performantly, and the dates will be fixed once we finish back-filling. More or less May 2015 to Now - 1 day [18:32:22] Analytics: Make reportupdater output emtpy values when query returns no results. - https://phabricator.wikimedia.org/T117537#1776969 (Milimetric) p:Triage>High [18:32:43] moritzm: sure I was doing it :) [18:33:06] Analytics: Community has a Stats landing page with links - https://phabricator.wikimedia.org/T117496#2040844 (Milimetric) Open>declined a:Milimetric Wikistats 2.0 should be such a resource [18:33:25] Analytics: Implement re-run script for reportupdater - https://phabricator.wikimedia.org/T117538#2040848 (Milimetric) p:Triage>Normal [18:33:43] ja one at a time is fine [18:33:47] Analytics, Easy: Implement re-run script for reportupdater - https://phabricator.wikimedia.org/T117538#1776983 (Milimetric) [18:33:59] Analytics-Kanban, Reading-Admin, Patch-For-Review: Tabular layout on dashiki [8 pts] {lama} - https://phabricator.wikimedia.org/T118329#2040853 (Milimetric) [18:34:01] Analytics: Run browser reports on hive monthly - https://phabricator.wikimedia.org/T118330#2040851 (Milimetric) Open>declined a:Milimetric [18:34:39] Analytics: Create depot to have different API clients for pageview API - https://phabricator.wikimedia.org/T118190#2040858 (Milimetric) Open>declined [18:35:18] Analytics, Analytics-Cluster, Operations: Audit Hadoop worker memory usage. - https://phabricator.wikimedia.org/T118501#1802076 (Milimetric) p:Triage>Normal [18:35:44] Analytics, Pageviews-API: AQS: query multiple articles at the same time - https://phabricator.wikimedia.org/T118508#1802156 (Milimetric) p:Triage>Normal [18:38:59] moritzm: done! +doc [18:40:11] Analytics-EventLogging, MobileFrontend, Easy, Technical-Debt, Upstream: Should be possible to override sampling in EventLogging schemas for development purpose - https://phabricator.wikimedia.org/T125122#2040889 (Jdlrobson) [18:40:15] (CR) Joal: "Added a comment about data localisation." (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) (owner: Mforns) [18:47:09] Analytics, Pageviews-API, Services: Better error messages on pageview API - https://phabricator.wikimedia.org/T126929#2027297 (Pchelolo) @Nuria For parameter validation we rely on the [[ https://www.npmjs.com/package/ajv | ajv ]] JSON-schema validation library, and we're sending out the default error... [18:47:21] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Upgrade analytics and beta project Analytics Clusters to CDH 5.5 [8 pts] - https://phabricator.wikimedia.org/T127115#2040976 (Ottomata) Beta done. [18:47:24] Analytics, HyperSwitch, Pageviews-API, Services: Better error messages on pageview API - https://phabricator.wikimedia.org/T126929#2040977 (Pchelolo) a:Pchelolo [18:47:31] (CR) Mforns: Improve the format of the browser report (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/271253 (https://phabricator.wikimedia.org/T126282) (owner: Mforns) [18:49:56] ottomata: quick question about kafka100[12] - any special precautions to reboot them? [18:50:05] since they are EB ones [18:52:31] Analytics, HyperSwitch, Pageviews-API, Services: Better error messages on pageview API - https://phabricator.wikimedia.org/T126929#2040988 (Nuria) >The reason why they don't include allowed values in the message is that enum could be a set of objects or super-long strings, so the message could bec... [18:53:29] elukey: no they shoudl be just like the others [18:53:39] make sure you do elections between them [18:54:03] most topics there only have one partition [18:54:06] with 2 replicas [18:54:46] very simple then [18:54:52] I'll do them tomorrow :) [18:57:59] (CR) Nuria: [C: 2 V: 2] "Merging. Per our IRC conversation I assume you have tested on cluster." (1 comment) [analytics/refinery] - https://gerrit.wikimedia.org/r/270921 (https://phabricator.wikimedia.org/T120113) (owner: Joal) [19:00:38] joal: this is kinda nice: https://issues.apache.org/jira/browse/OOZIE-2187 [19:01:01] can reduce a little bit of boiler plate from oozie properties and xml [19:01:24] For sure ottomata ! [19:01:25] oo also https://issues.apache.org/jira/browse/OOZIE-2160 [19:01:50] Yay ! [19:02:29] oh joal IInteresting. [19:02:30] • Dynamic allocation is enabled by default. You can explicitly disable dynamic allocation by using the optionspark.dynamicAllocation.enabled = false. Dynamic allocation is implicitly disabled if --num-executors is specified in the job. [19:02:39] a-team: logging off, talk with you tomorrow! [19:02:43] byyeee [19:02:43] nite elukey [19:02:44] bye elukey ! [19:02:52] bye elukey ! [19:03:38] ottomata: That's great, we probably want to test the dynamic allocation before removing the preset num-executors though :) [19:04:14] milimetric: agreed that burrow alarms with "0" at the end for lag do not make much sense ... mmmm [19:04:19] a-team, gone for diner, will deploy refinery and restart monthly fialing jobs tomorrow morning ! [19:04:33] ok joal see you! [19:06:18] joal: indeed [19:06:21] laters! [20:39:43] ellery: oliver mentioned that you did some work on the implications the HTTPS switchover had for referer counting... [20:40:21] ..did you get a sense on whether/how much it might have impacted the overall ratio of pageviews with google referer? (i.e. the first chart at http://ironholds.org/blog/is-google-tanking-wikipedias-traffic/ )