[00:00:57] (repo, IDEA, plugins are all up-to-date) [01:35:29] 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Regression: Beta Cluster: Unable to obtain lock via objectcache (memcached) - https://phabricator.wikimedia.org/T198280#4320871 (10Krinkle) [01:35:53] 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Regression: Beta Cluster: Unable to obtain lock via objectcache (memcached add() fails) - https://phabricator.wikimedia.org/T198280#4318032 (10Krinkle) [01:49:50] Hi - as part of T195314, I've been looking at various services in Beta Cluster to verify they are working well to support the navtiming pipeline. It seems that the kafka-jumbo brokers are currently not working. Both via python/rdkafka and with kafkacat, both jumbo-1 and -2 are refusing connections. [01:49:51] T195314: Set up webperf node in Beta Cluster - https://phabricator.wikimedia.org/T195314 [01:50:19] The syslog of deployment-kafka-jumbo-2 says "Connection to node -2/-1 could not be established. Broker may not be available" [01:50:49] Looking at https://tools.wmflabs.org/nagf/?project=deployment-prep#h_deployment-kafka-jumbo-1_disk-bytes [01:50:53] It seems disk space is 0 [02:37:20] Krinkle: are you by any chance referring to deployment-eventlog05.eqiad.wmflabs? if so that explains why I wasn't seeing any events in any of the eventlogging logs (locations at https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster#How_to_verify_events) I thought maybe something was wrong on my end and events weren't getting sent [02:38:18] bearloga: I don't know of anything wrong with eventlog05, however as I understand it EventLogging both consumes and produces with Kafka (consumers from the raw varnishkafka topic, produces to the eventlogging_ topics). [02:38:33] So with the kafka-browser host down/broken, I would imagine eventlog05 doing nothing. [02:38:45] I'm experiencing the same with webperf11 (processes navtiming beacons) [02:38:50] nuria_ elukey: ^ [04:01:36] Krinkle: yeah, this is a known issue with beta cluster, machines are small and disk fills in, they normally need a reboot a removal of data [04:22:41] *and removal of data [04:36:59] helloooo a-team, starting super early today to leave in the afternoon for Madrid [04:48:01] wow fdans OVERLAP! [04:48:23] nuria_: hell yeaaa HELLO FROM TOMORROW [04:48:25] Krinkle: i am not on kafka labs project , will wait for EU team to reboot [04:48:37] Krinkle: and will get permits so i ca help next time [04:48:54] fdans: jajaja [05:08:56] (03CR) 10Nuria: [C: 032] Add oozie jobs loading druid daily uniques monthly (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/436826 (https://phabricator.wikimedia.org/T197885) (owner: 10Joal) [05:34:59] morning! [06:10:11] !log move /srv/kafka to a dedicated 60G partition on deployment-jumbo hosts in deployment-prep [06:10:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:13:36] ok so kafka in deployment prep + eventlogging should be fine now [06:42:15] in theory we have retention policies for kafka logs that should avoid filling up the disk [06:43:02] but up to this morning we were running 20G partitions for kafka, that may be too short to see deletion kicking it [06:43:05] *in [06:44:30] maximum size for a log is log.segment.bytes=134217728 (~134M) [06:44:55] log.retention.hours=168 -> (1w) [06:45:16] and also [06:45:17] # A size-based retention policy for logs. Segments are pruned from the log as long as the remaining [06:45:20] # segments don't drop below log.retention.bytes. Functions independently of log.retention.hours. [06:45:23] log.retention.bytes=134217728 [06:48:49] so let's drop the retention to say 3 days [06:48:53] in labs it should be fine [06:49:22] even if I'd wait a bit during the next days to see if with 60G the cleanup policies stabilize [06:49:47] I'll send an email to Krinkle/bearloga/nuria --^ [06:54:01] Hi folks :) [06:58:42] (03CR) 10Joal: Add oozie jobs loading druid daily uniques monthly (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/436826 (https://phabricator.wikimedia.org/T197885) (owner: 10Joal) [07:02:43] hello joal :) [07:03:16] elukey: I'm preparing a small puppet patch for Sqoop commands-change, then we could do a cluster deploy? [07:03:22] +1 [07:03:54] joal: I was wondering if it was worth to take a course in Spark, do you happen to know any experts in the field that can help? :P [07:04:24] elukey: I don't - But I may have some hints here and there if you want [07:04:57] you don't? I am reading a link and I am pretty sure that you know one :) [07:05:09] elukey: I feel so ashamed of that email nuria_ sent !! I told you lot, just did not advertise :) [07:05:27] :D :D :D [07:05:40] congrats man, really great to see a colleague invited to a conf :) [07:05:58] why ashamed!!! [07:06:01] elukey: was very interesting to look at our stack from a different usage perspective [07:06:18] the part I'm ashamed of is "without telling us" :) [07:06:38] ah yes this is definitely true then! :P :P [07:09:10] :) [07:11:13] fdans: I'm going to cluster-deploy today - Shall we go and merge the interval thing? [07:11:29] joal: yessss [07:11:32] okey :) [07:12:01] fdans: anything else? [07:12:08] joal: the aqs thing goes first right? [07:12:14] fdans: It should yes [07:12:26] fdans: in term of feature release I mean [07:12:36] fdans: We can deploy loading change but not restart the job [07:12:52] ok then don't merge the stuff in refinery yet, I want to triple check the oozie job [07:13:04] fdans: works for me [07:13:27] has anybody checked the last pageview whitelist emails? [07:13:29] (03CR) 10Fdans: [V: 032 C: 032] Add glue code to turn "ceiled" pageview values into intervals (032 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/440136 (https://phabricator.wikimedia.org/T188928) (owner: 10Fdans) [07:13:44] would be great to add new values to the whitelist before deploying in case [07:13:49] elukey: I have not - I'll do [07:14:01] I can do it while you prep the deploy joal [07:14:07] ok elukey [07:14:18] sorry elukey I totally overlooked the email at 9pm [07:14:19] elukey: currently triple checking the list of stuff to deploy :) [07:15:11] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/436826 (https://phabricator.wikimedia.org/T197885) (owner: 10Joal) [07:16:14] fdans: np! It is a minor thing, I was just checking as well :) [07:16:46] project wikimedia 1 2018 6 27 18 [07:17:06] WAT? [07:17:13] yeah it is a bit weird [07:17:26] it happened two times [07:17:37] Let's look for those lines [07:21:34] "And you can configure Matomo to connect to your database over SSL. " [07:21:37] https://matomo.org/changelog/matomo-3-5-0/ [07:21:39] yesssss [07:21:44] this is great [07:22:06] joal: should I oozie load a couple months in a test cassandra keyspace in production? [07:26:43] fdans: That would be great, yes - le's say 2 months for instane? [07:29:04] joal: cool, I need you or elukey to create local_group_default_T_top_bycountry_test since I don't have prod cassandra access [07:29:27] if you're not dealing with something urgent, that is :) [07:30:08] fdans: will do, except for the name: will put test at the beginning if you agree :) [07:30:22] joal: sounds good to me [07:31:54] fdans: I create this new keyspace with the same exact settings as local_group_default_T_top_bycountry, right? [07:32:09] joal: that's right [07:32:21] only the content of the countries json changes [07:32:27] fdans: done [07:32:35] test_local_group_default_T_top_bycountry [07:32:38] fdans: --^ [07:32:43] this is the exact keyspace name [07:32:50] thank youuuu so much joal [07:33:10] changing name in job in my home dir [07:33:12] fdans: When you have exectuted some loading, let me know, we can query cassandra to check [07:33:23] gotcha [07:33:52] fdans: Shall I deploy AQS now (while we prep cluster)? [07:33:53] joal: analytics1003 updated [07:33:58] \o/ [07:34:00] Thanks elukey :) [07:34:10] elukey: now I REALLY need to deploy the cluster :) [07:34:42] joal: that would be awesome, but it's my ops week! you really don't have to :) [07:35:15] fdans: no worries, deploy day :) [07:35:30] all yours then joal [07:41:31] !log testing load of 2 months of per country pageviews with the new ceiled value [07:41:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:52:30] (03PS1) 10Elukey: Add 'wikimedia' to the pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/442787 [07:52:45] there --^ [08:01:11] (03CR) 10Joal: [V: 032 C: 032] "Good for me :) Merging" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/442787 (owner: 10Elukey) [08:08:30] (03CR) 10Sahil505: "@Nuria : Just to confirm naming should be "legacy pageviews" OR "legacy pagecounts"?" (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/442257 (https://phabricator.wikimedia.org/T189619) (owner: 10Sahil505) [08:08:51] joal: feel free to check in cassandra [08:12:27] fdans: doing [08:13:28] fdans: looks like you loaded interval values, not ceiled ones [08:14:16] also fdans - I have just re-checked the test for AQS before deploying, andf they dail :( [08:14:22] * joal doesn't underdstand [08:14:33] ayyy [08:14:55] joal: is there any particular historical reason why we don't have CI in aqs? [08:15:09] fdans: I don't know [08:15:16] I don't think so [08:16:52] joal: all right, will correct all [08:18:58] fdans: I get it now - it's because you mix result-types in your test [08:19:53] fdans: You should have 2 tests, one for nuerical values, one for string ones, since we only check the type of the 1st value before either converting or returning as is [08:24:14] joal: I'm more confused by the cassandra loading [08:24:25] :( [08:25:10] both queries in hdfs and local have the up to date query, I've no idea why it's storing intervals... [08:26:21] any idea on where would oozie be pulling that hql from? [08:26:30] joal: ^^ sorry [08:26:40] fdans: Have you updated oozie_directory in your oozie job launch? [08:27:59] fdans: Can't find your oozie coord in hue - Do you have an ID? [08:28:29] joal: 0065484-180510140726946-oozie-oozi-C [08:28:30] ok found it [08:28:46] fdans: Inded, you have not updated the path for oozie to load the new code [08:29:14] fdans: See in coordinator.properties - We use ${oozie_directory} to reference files path [08:29:37] joal: that's the hdfs oozie directory right? [08:29:46] fdans: When you test, you want to override oozie_directory} to your HDFS instanciation of oozie folder [08:30:00] oook [08:30:25] -Doozie_directory=hdfs:analytics-hadoop/user/fdans/oozie... [08:32:53] 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Regression: Beta Cluster: Unable to obtain lock via objectcache (memcached add() fails) - https://phabricator.wikimedia.org/T198280#4321094 (10aaron) I get: ``` > $s = new RemoteSchema('NavigationTiming', 18156125); > var_dump( $s->cache->lock... [08:33:34] k, resending, will correct tests now [08:44:18] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update piwik to latest stable - https://phabricator.wikimedia.org/T192298#4321105 (10elukey) The last version of Matomo (the new Piwik name) contains a super interesting feature, namely use TLS for connections to the database. This is not important now that we... [08:44:35] (03PS1) 10Fdans: Correct per country tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/442796 (https://phabricator.wikimedia.org/T188928) [08:45:20] (03PS2) 10Fdans: Correct per country tests [analytics/aqs] - 10https://gerrit.wikimedia.org/r/442796 (https://phabricator.wikimedia.org/T188928) [08:53:30] fdans: do we have the piwik tracking js code somewhere handy? [08:53:42] I want to check if anything changes from piwik and matomo [08:53:55] like for wikistats2 etc.. [08:54:25] elukey: looking [08:54:35] joal: the data should be loaded now in cassandra :) [08:54:44] fdans: checking fdans [08:55:09] * joal checks fdans for fdans ... Autoreference for the win [08:55:23] https://www.irccloud.com/pastebin/uaLoyBS6/ [08:55:26] elukey: ^ [08:56:16] fdans: Have you loaded the same previous month and all, or have you changed? [08:56:31] joal: same [08:56:46] fdans: new data is there with views, however old data is here as well - Not goog [08:57:02] fdans: <3 [08:57:19] joal: wait [08:57:31] i didn't load may this time [08:57:42] joal: that's why you may see both [08:57:43] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update piwik to latest stable - https://phabricator.wikimedia.org/T192298#4321127 (10elukey) Just to be super sure, triple checked that the matomo's tracking JS code didn't change: ```