[02:14:09] (PS6) Milimetric: Allow filtering of data breakdowns [analytics/dashiki] - https://gerrit.wikimedia.org/r/278395 (https://phabricator.wikimedia.org/T131547) (owner: Jdlrobson) [02:19:17] (CR) Milimetric: [C: -1] "All the tests failed for me, must be something simple but central. The toggling works but I didn't review because the patterns don't stay" [analytics/dashiki] - https://gerrit.wikimedia.org/r/278395 (https://phabricator.wikimedia.org/T131547) (owner: Jdlrobson) [02:20:16] (CR) Milimetric: [C: 2 V: 2] Update scap deployment configuration for aqs [analytics/aqs/deploy] - https://gerrit.wikimedia.org/r/280921 (owner: Joal) [02:32:37] (CR) Milimetric: [C: 2 V: 2] "Looks good, nice cleanup. Especially the tests, I like how you repurposed some of the old ones because they're catching new edge cases." [analytics/reportupdater] - https://gerrit.wikimedia.org/r/280201 (https://phabricator.wikimedia.org/T131049) (owner: Mforns) [02:34:59] (CR) Milimetric: [C: 2 V: 2] Reconfigure mobile-options-last-3-months query [analytics/limn-mobile-data] - https://gerrit.wikimedia.org/r/281818 (https://phabricator.wikimedia.org/T131849) (owner: Mforns) [02:41:37] (CR) Milimetric: [C: 2 V: 2] "I didn't like the float() use, I like Decimal instead because floats are evil. But in this case it really doesn't matter so I'll merge :)" [analytics/reportupdater-queries] - https://gerrit.wikimedia.org/r/280386 (https://phabricator.wikimedia.org/T130406) (owner: Mforns) [03:33:41] Analytics-Kanban, Release-Engineering-Team: [Spike] Figure out how to automate releases with jenkins {hawk} - https://phabricator.wikimedia.org/T130576#2182662 (madhuvishy) More progress! 6. Ran into some issues while adding the gerrit ssh key to /etc/ssh/ssh_known_hosts on the Jenkins slaves - they are d... [04:03:14] Analytics, iOS-app-Bugs: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2182669 (Tbayer) And it does not just affect pageviews tagged as iOS in `user_agent_map`, but also pageviews tagged as app views in `access_method` overall: {F3835623, width=80%} Data source: ```lang... [05:35:32] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016), developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2182708 (Lcanasdiaz) >>! In T103292#2167926, @Aklapper wrote: >>>! In T103... [05:49:51] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016): Mismatch between numbers for code merges per organization - https://phabricator.wikimedia.org/T129910#2182709 (Lcanasdiaz) This is clearly a bug with the KPIs we created for you guys. Sorry about this, it will be fixed asap. [06:41:29] Analytics, iOS-app-Bugs: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2182732 (Tbayer) @JMinor and I did a test today, and it appears that the app's requests aren't even showing up in the webrequest table. Data source: ```lang=sql SELECT * FROM wmf.webrequest WHERE year... [07:46:21] o/ [07:46:43] joal: helllooooooooo! Let me know when you are free to chat about our dear friend Cassandra :) [08:24:26] my current theory is that nodetool drain is too aggressive and doesn't remove the node from the ring [08:25:01] but theoretically it should be as a node "fails" due to other problems, like powerdown, etc.. [08:25:13] so the cluster should go on without it anyway [08:27:03] aaaahhhhhhh https://wiki.apache.org/cassandra/NodeTool [08:27:10] drain: Flushes memtables on the node and stop accepting writes. Reads will still be processed. Useful for rolling upgrades. [08:29:15] so the problem might have been systemctl stop cassandra [08:29:31] but shouldn't cassandra handle these type of failures gracefully? [08:32:41] http://www.planetcassandra.org/general-faq/#0.1_arch-9 still doesn't make sense [08:51:34] * elukey verifies the replication factor [09:08:34] Do not use the default replication factor of 1 for the system_auth keyspace. In a multi-node cluster, using the default of 1 precludes logging into any node when the node that stores the user data is down. [09:08:45] system_auth | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"} [09:11:23] can query SELECT * FROM system_auth.users; though on each node.. [09:12:43] but in the stack trace I can find [09:12:44] Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM at org.apache.cassandra.auth.Auth.selectUser(Auth.java:276) ~[apache-cassandra-2.1.12.jar:2.1.12] [09:26:08] https://wikitech.wikimedia.org/wiki/Cassandra#Authentication [09:30:20] Analytics-Tech-community-metrics, Developer-Relations (Apr-Jun-2016), developer-notice: Check whether it is true that we have lost 40% of (Git) code contributors in the past 12 months - https://phabricator.wikimedia.org/T103292#2182963 (Lcanasdiaz) >>! In T103292#2167926, @Aklapper wrote: >>>! In T103... [09:56:29] Analytics-EventLogging, MediaWiki-Vagrant: EventLogging vagrant role fails to provision - https://phabricator.wikimedia.org/T131085#2155589 (AdHuikeshoven) Today I got a related error after enabling flow: ``` ==> default: Error: Could not find command '/vagrant/srv/eventlogging/virtualenv/bin/pip' ``` [10:19:37] Analytics, Operations: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183066 (elukey) So I checked the replication factor on the aqs nodes and this is the result: ``` cassandra@cqlsh> SELECT * FROM system.schema_keyspaces; keyspace_name | dura... [10:19:49] Analytics, Operations: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183067 (elukey) p:Triage>Normal [10:26:37] (brb lunch!) [11:26:18] elukey: sorry for not having shown up earlier [11:26:28] hellooooo [11:26:31] elukey: I have to care Lino today, so I'll be in-and0out: ) [11:26:54] don't absolutely worry, nothing urgent, I was thinking out loud [11:27:06] I updated https://phabricator.wikimedia.org/T123629#2183066 with a summary [11:27:26] I've read your thoughts, and it's interesting finding the system_auth replication [11:27:47] * joal loves to read toughts :) [11:28:59] elukey: summary is great :) [11:29:18] elukey: We hsould proceed with changing replication factor and a repair [11:29:36] elukey: I back you up on this one ;) [11:29:54] joal: super! from what I know there are periodical repair happening, but we should probably run it JUST IN CASE TM [11:30:22] * joal is an addept of elukey TM sentences :) [11:30:40] ahahahaah [11:31:01] so, I should run "ALTER KEYSPACE system_auth WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 };" on each node and then run nodetool repair only on ne [11:31:05] right? [11:31:08] *on one [11:31:38] elukey: hm, I wonder: other keysapces use NetworkTopologyStrategy [11:31:52] I don't know if it's better or not, in our case [11:32:52] elukey: https://docs.datastax.com/en/cassandra/1.2/cassandra/architecture/architectureDataDistributeReplication_c.html [11:33:08] from that, it seems simple strategy works for since we only have one DC [11:33:30] So yes, you can run "ALTER KEYSPACE system_auth WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 };" [11:33:41] And don't do it on each node, one should be enough [11:33:43] I think [11:33:46] yeah I've read the same, I believe that network topology was probably added with dallas thoughts in mind? [11:33:56] ahhh okok, proceeding [11:34:27] elukey: Network topology is created by Restbase, and restbase can be crossDC [11:35:24] elukey: let me know when you're done with the query [11:35:41] elukey: I'm querying aqs1003 for rep factors to double check :) [11:36:02] done :) [11:37:19] elukey: good on aqs1003 [11:37:35] same on aqs10023 [11:37:36] 2 [11:37:47] Niiice :) [11:37:57] also https://logstash.wikimedia.org/#/dashboard/elasticsearch/analytics-cassandra is not screaming :) [11:39:06] filippo also mentioned https://phabricator.wikimedia.org/T92355 to me [11:39:42] elukey: ok [11:40:00] mforns: this looks pretty interesting to try (new color interpolator with a d3 plugin): HCL: a color model that actually matches our perceptions (2011) [11:40:00] http://vis4.net/blog/posts/avoid-equidistant-hsv-colors/ [11:40:09] mforns: this looks pretty interesting to try (new color interpolator with a d3 plugin): HCL: a color model that actually matches our perceptions (2011) [11:40:09] http://vis4.net/blog/posts/avoid-equidistant-hsv-colors/ [11:40:28] heh, sorry, morning brain [11:40:47] elukey: np milimetric :) [11:41:44] joal: I think we could skip the repair, what do you think? [11:41:50] or is it necessary? [11:42:10] elukey: I actually don't know if it's necessary [11:42:49] because I am not sure how heavy could it be on the cluster.. shouldn't be a problem, but I have no experience [11:43:17] elukey: from https://wiki.apache.org/cassandra/Operations, we should run repair [11:43:27] So let's do it :) [11:44:04] elukey: cause currently, because of previsou replFact 1, some nodes might not have the data [11:44:26] running repair forces to be consistent data-wise (not only schema-wise) on each node [11:44:33] all right! [11:45:30] !log started nodetool repair on aqs1002 after running "ALTER KEYSPACE system_auth WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': 3 };" [11:45:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [11:50:44] taking ages to complete... mmmm [11:51:39] elukey: this is expected [11:51:56] elukey: it double checks every keyspace for correct data [11:52:12] elukey: so it takes ages and require a good bit of CPU+IO [11:53:20] joal: ah yes it is now doing repairing 768 ranges for keyspace local_group_default_T_pageviews_per_project [11:54:00] ;) [11:54:23] elukey, funny how aqs1002 has a bump of nice on CPU since 40mins ;) [11:56:33] where did you check it? [11:56:46] ganglia:) [11:57:03] ahhh yes :) But overall it is looking good [11:57:15] elukey: definitely [11:58:33] Analytics, Operations: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183195 (elukey) Executed the command and started nodetool repair on aqs1002. [12:02:20] elukey: Lino wakes up, will be back in a while [12:02:37] sure!! thanks joal! [12:02:40] I'll keep watching [12:11:54] Analytics-Kanban, Release-Engineering-Team: [Spike] Figure out how to automate releases with jenkins {hawk} - https://phabricator.wikimedia.org/T130576#2183215 (hashar) I have removed the `Ldaptestaccount123 ` user from the Gerrit `Analytics-devs` group since the password has been made public here. https:/... [12:33:53] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Analytics%20Query%20Service%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1459945981&g=network_report&z=large [12:33:57] woa [12:37:39] stat1002's disk has been replaced [13:16:21] elukey: what's the woa for? [13:18:21] milimetric: network traffic :) [13:19:27] that it's spikey? [13:19:43] that's probably just the loading, no? [13:20:18] looks like it happens right after midnight in the weekly graph [13:20:45] nono it was me running nodetool repair :) [13:21:03] it is still running (compacting sstables for local_group_default_T_pageviews_per_article_flat from what I can see) [13:21:13] but the first part was triggering a lot of traffic :) [13:36:08] anybody working on analytics1051 by any chance? [13:36:41] shouldn't be a new node and it has been rebooted apparently [13:39:12] don't appear to be a new node.. [13:49:40] mmmmmmm nodetool repair seems stuck for some reason [13:49:57] I am not seeing load or logs related to it [13:50:02] only regular compaction [13:50:21] and this "Lost notification. You should check server log for repair status of keyspace local_group_default_T_pageviews_per_article_flat" [13:50:35] joal, milimetric: did you get this error in the past? [13:51:58] elukey: Can't recall [13:52:25] nope [13:54:01] elukey: the good person to ask is urandom (our cassandra expert) [13:54:47] ah yes, in security they mentioned him :) [13:55:03] so the repair process is not running anymore, but I guess that we could launch a new one [13:56:07] good mororrningg [13:56:29] hello ottomata [13:56:48] elukey: I think you can lauch a new one on another node [13:57:00] hello ottomata! [13:59:25] joal: I am running "nodetool repair system_auth" on aqs1001 [13:59:33] cool elukey [13:59:46] !log ran nodetool repair system_auth on aqs1001.eqiad [13:59:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:04:04] !log ran nodetool repair system_auth on aqs1002.eqiad/aqs1003.eqiad [14:04:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log, Master [14:13:53] elukey: you probably want to use -pr on those (primary range) [14:14:13] elukey: for system_auth, it probably doesn't matter a lot, but for the others it will [14:14:56] urandom: hat does -pr do ? [14:15:27] elukey: without -pr, it will repair every range on the node, the primary, and all replicas, comparing them against the other nodes. with -pr it only does the primary range, and syncs that to the replicas, so tl;dr you're only going over the data once [14:15:51] urandom: makes sense ! [14:16:10] i.e. it avoids triple-handling of the data [14:16:59] if the node was down for a bit, and that's why you were repairing, a regular repair makes sense, you want to fix everything for that specific node [14:17:20] but if you're running routine repairs across the entire cluster, -pr is your friend [14:17:39] thanks for the explanation :) [14:17:43] urandom: in that specific case, it's to ensure data consistency after replication factor change [14:18:11] joal: ok, so that is something you need to do on all nodes, and ideally you'd use -pr [14:18:23] makes sense [14:18:31] system_auth is small tho [14:18:59] urandom: now that I've ran nodetool repair on one node and got stuck in "Lost notification. You should check server log for repair status of keyspace local_group_default_T_pageviews_per_article_flat" what should I do? [14:19:31] running nodetool repair -pr again on it and on the other nodes? [14:19:53] (very ignorant about cassandra so sorry for the dumb questions) [14:19:54] no, the repair is still running, you'll just need to watch the logs if you need to know when it's done [14:20:00] elukey: no, no worries [14:20:47] the repair runs in the background, but as a convenience, it tries to do a pub-sub thing using a jmx notification, so that nodetool will block and send you status updates [14:20:57] but jmx notifications are janky and sometimes it disconnects [14:21:16] all right, makes sense [14:21:30] so it should keep truck along, you just can't see it in the console anymore (hence the need to watch the logs) [14:22:18] yeah I was puzzled because I was seeing only compaction of ss_tables and nothing more for more than an hour, so I thought it was somewhat stuck/not_working anoymore [14:22:22] *anymore [14:22:59] urandom: about the loading issue, I'll apply the strategy we discussed briefly yesterday: go deeper into the loading code and try to find the root cause [14:23:22] joal: seems like the problem is localized to aqs1002 [14:23:52] urandom: I agree [14:24:03] urandom: however other jobs don't fail ! [14:24:04] joal: i don't suppose there is anything about that process that is specific to that node? it's not mulitple processes (threads, etc) distributed over the cassandra nodes, is it? [14:24:17] joal: it's only one type of job that fails? [14:24:23] urandom: yes ! [14:24:35] all the others (even the huge one) succed [14:24:39] succeded [14:24:40] sorry [14:25:03] interesting [14:25:28] That's why I was wondering about a repair, because it seems related to that specific keyspace [14:26:24] The other completely weird thing is that for some jobs that have failed, we sometimes manage to have them working, who know why [14:26:29] joal: these are just writes, no? [14:26:35] That's completely unpredicatable [14:26:42] urandom: yessir [14:26:48] urandom: UPDATE to be precise [14:27:06] yeah, i don't think a repair is going to do anything here [14:27:13] right [14:28:15] from the logs, like yesterday, it starts with some org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level QUORUM errors [14:28:26] which might be kind of normal for a high-load scenario [14:28:35] for some value of "normal" [14:28:37] urandom: very small data tho [14:30:20] urandom: I have some questions about cassandra and a phab task, can I ask them in here or do you prefer an email? (I promise that I'll write documentation!) [14:30:33] elukey: shoot [14:30:48] thanks! [14:31:51] 1) we need to upgrade nodejs from 0.10 to 4.3 on AQS, and we were unsure if cassandra needs to be stopped as part of the node upgrade. [14:32:30] nope [14:33:14] you shouldn't need to touch Cassandra for that [14:33:26] 2) for https://phabricator.wikimedia.org/T123629 (that is the node upgrade) i tried to stop cassandra on aqs1001 (after nodejs stop) doing nodetool drain && systemctl cassandra stop. It didn't go well, tons of Quorum errors like the one you were describing above [14:34:04] the only possible explanation that I found was the system_auth replication [14:34:38] but in general, I'd like to know if the drain && stop is correct or not (also to update https://wikitech.wikimedia.org/wiki/Service_restarts) [14:35:06] I've read about disablegossip && cassandra stop, not sure what is the best [14:35:14] (let's say if you have to reboot a host) [14:35:28] auh ha! [14:35:48] i guess this is why you guys are changing the replication factor of system_auth? [14:35:52] yesss [14:36:10] because you've already realizd the quorum failures are from auth [14:36:40] I think it was auth, not really sure, but it is the only thing that made sense to me [14:37:33] system_auth | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"} <-- yeah, that's a problem [14:37:44] \o/ [14:37:55] elukey: told you, you ROCK :) [14:38:09] why that hasn't caused you more problems until now, i cannot fathom [14:39:20] joal: so any chance the failures, caused by the unavail exceptions are stacking up on the loader-end and not being handled? [14:39:41] urandom: completely possible [14:40:04] urandom: currently setting up my testing env for deeper analysis [14:40:19] after those errors (on the Cassandra-side), you see very low level errors from netty (the framework used for the CQL transport) that seem to indicate the remote end just shut down the connection [14:40:55] urandom: That's what the logs say at least [14:42:14] urandom: currently trying to load some data in keyspace local_group_default_T_unique_devices_test_joal [14:42:59] k [14:43:53] so there might be some underlying problem with one node that manifests as these errors making quorum, or it might be something transient caused by high load [14:44:04] either way you should be able to survive it [14:44:12] a hicup with one node, i mena [14:44:14] mean [14:44:20] Looks like my loading is blocked as usual: trying to insert 614 rows, and currently only 394 are commited, while my job tells me it should be doneb [14:44:26] and it looks like you guys are on the right track to make that so [14:44:55] joal: just now? [14:45:01] urandom: it usually works fine: our heavy loading job almost never fial, meaning some minor errors are usually handled correctly [14:45:05] currently yesb [14:45:23] joal: yeah, just now: https://phabricator.wikimedia.org/P2864 [14:45:59] RAAAAH ! makes no sense [14:46:07] urandom: possibly data-related? [14:46:58] joal: ¯\_(ツ)_/¯ [14:47:02] :) [14:47:05] :) [14:47:32] joal: i see no recent exception for making quorum, though [14:48:01] urandom: seems related to something else, but I can't put my finger on it ... Will keep pinpointing [14:51:15] joal: mmmm the error seems only happening on aqs1002 [14:51:25] elukey: yeah, we noticed that [14:51:37] elukey: That doesn't make sense tho ! [14:54:50] there must be something weird on the host, will try to check [14:55:00] elukey: Thanks a mil mate [14:55:18] elukey: seems related to the newly created table, since old jobs don't fial [14:55:59] very newbie question about cassandra: when you load data you can hit whatever node you wish right? [14:56:10] elukey: yes [14:56:48] and even if aqs1002 fails for some reason the quorum should be reached no? [14:57:10] yeah [14:57:11] assuming that 1001/1003 don't fail [14:57:17] right [14:58:08] that's why I mentioned the error joal pasted from yesterday, from the loading process, it looked like the failure was sourced from aqs1002, but it also looked like the driver was attempting to use no other nodes [14:58:13] which seemed wierd [14:58:23] maybe i was reading the exception wrong [14:58:25] urandom: weird indeed [14:58:44] even if there were only one contact point, the driver should discover the other nodes [14:58:49] I'm going into meeting, will be back on thqat after [14:59:13] me too, thanks a lot for all the info! [14:59:18] urandom: it actuall does when setting connections, but doesn' seem to recover correctly from errors [14:59:23] no worries, let me know how I can help! [14:59:27] Analytics-Kanban: Align labels in legend - https://phabricator.wikimedia.org/T131935#2183540 (Milimetric) [14:59:34] urandom: we'll get back to you probably at some point ;) [14:59:38] Thanks a lot again :) [14:59:40] kk [14:59:42] no worries [14:59:53] (PS1) Milimetric: Sort the legend by value and align it [analytics/dashiki] - https://gerrit.wikimedia.org/r/281947 (https://phabricator.wikimedia.org/T131935) [15:14:40] Analytics-EventLogging, MediaWiki-Vagrant: EventLogging vagrant role fails to provision - https://phabricator.wikimedia.org/T131085#2183604 (Ottomata) Hmmm, I can't seem to reproduce this. I ripped out my srv/eventlogging and provisioned, had no problem cloning or creating the virtualenv or installing d... [15:15:32] Analytics-Cluster, Operations, hardware-requests: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2183608 (Ottomata) @robh, bump on this too. [15:17:31] Analytics: Evaluate whether to rewrite varnishkafka in python - https://phabricator.wikimedia.org/T131938#2183609 (Nuria) [15:27:00] Analytics-Cluster, Operations, hardware-requests, netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2183653 (RobH) a:RobH>None Yes, I think we need a network admin to investigate the dhcp ability of the analytics vlan to carbon, as I cannto seem to... [15:44:22] Analytics-Kanban, Release-Engineering-Team: [Spike] Figure out how to automate releases with jenkins {hawk} - https://phabricator.wikimedia.org/T130576#2183715 (madhuvishy) Spoke about this on irc already but leaving it here - Only the username of the test user was public, not the password. The commit auth... [15:55:54] joal: do you have by any chance other logs like https://phabricator.wikimedia.org/P2864 saved? [15:56:20] was trying to figure out if 10.64.36.132:42328 :> /10.64.32.175:9042 changes or not [15:56:49] (not sure how you are loading stuff to cassandra though ) [16:02:36] elukey: will try to find some for you :) [16:06:01] Analytics-Kanban: Put quaterly review together for Q3 - https://phabricator.wikimedia.org/T131947#2183874 (Nuria) [16:06:19] Analytics-Kanban: Put quaterly review together for Q3 - https://phabricator.wikimedia.org/T131947#2183874 (Nuria) a:Nuria [16:07:28] Analytics-Kanban, Operations: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2183891 (elukey) [16:26:43] joal, elukey: you guys know about https://logstash.wikimedia.org/#/dashboard/elasticsearch/analytics-cassandra ? [16:27:18] urandom: I didn't before today, but found a link before on today :) [16:27:19] Thanks [16:28:00] ottomata: want your help adding permissions for the user on gerrit [16:28:02] urandom: yess thanks! [16:28:11] when you're back from lunch? [16:28:28] madhuvishy: let's do now [16:28:32] ottomata: oh [16:28:33] okay [16:28:51] urandom: I was wondering if we have per host metrics related to restbase (not aggregated). It would be great to spot differences in hosts [16:29:23] madhuvishy: sooo [16:29:24] what? [16:29:28] :_ [16:29:30] :) [16:29:31] elukey: we have per-host cassandra metrics [16:29:33] i had the user (chase.mp@gmail.com) added to https://gerrit.wikimedia.org/r/#/admin/groups/833,members, but it was removed today because of some misunderstanding - i need to add the user back, and also need the user to be able to push tags [16:29:36] ottomata: ^ [16:29:37] https://logstash.wikimedia.org/#dashboard/temp/AVPsaCuZO3D718AOl1JP btw (for connection resets) [16:29:49] elukey: restbase, probably not [16:30:02] elukey: cassandra, yes [16:30:23] ok madhuvishy that's it? [16:30:29] ottomata: yes [16:30:48] elukey: assuming i understand the question correctly [16:30:48] done [16:31:03] gwicke: thanks :) [16:31:10] elukey: we also have per-host sampled 'slow request' logs [16:31:12] ottomata: what is the special permission to push tags? is it different? [16:31:17] but I think those aren't enabled for aqs [16:31:19] urandom: yep yep [16:31:49] gwicke: sounds interesting, is there documentation about how to enable them? [16:31:51] madhuvishy: not sure what else is needed other than push (force?) [16:31:57] but analytics-devs have it on refinery/source [16:32:08] if we can do it, then this user should be able to [16:32:13] ottomata: hmmm that's what I had set up as of yesterday [16:32:15] elukey: https://github.com/wikimedia/operations-puppet/blob/production/modules/restbase/templates/config.yaml.erb#L894-L898 [16:32:25] ottomata: but it was able to push commits but not tags [16:32:25] madhuvishy: do you know what the error was? [16:32:28] yeah [16:32:29] hm [16:32:40] elukey: your sample rates should be higher, to compensate for lower request rates [16:32:48] there is a special perm for push annotated tags [16:32:54] ottomata: https://phabricator.wikimedia.org/T130576#2182662 Under point 9 [16:32:57] and push signed tags [16:33:25] but for example, I can push tags - and I don't think I have those permissions [16:34:04] hmmm might be the annotated tag one [16:34:09] madhuvishy: if you are testing this [16:34:09] gwicke: thanks! [16:34:13] try making an annotated tag [16:34:16] with git tag -a [16:34:19] ottomata: well [16:34:22] jenkins is [16:34:27] oh you know it is? [16:34:33] I'll take a look to it.. need to get some experience with restbase/cassandra :) [16:34:40] no i mean - i don't think it's annotated [16:34:50] hmmm [16:34:58] elukey: generally, the slow part is cassandra [16:34:59] ottomata: git tag -F /tmp/maven-scm-1052121632.commit v0.0.29 [16:35:03] is what it's doing [16:35:15] and then git push ssh://testaccount123@gerrit.wikimedia.org:29418/analytics/refinery/source refs/tags/v0.0.29 [16:35:26] ahhh madhuvishy i think that is annotateed [16:35:27] :D [16:35:28] if there is a message [16:35:41] ottomata: ha [16:36:02] hmm, mut, is that what maven is doing, right? [16:36:05] but* [16:36:07] joal: I think that your idea of debugging the loading part is right, the errors all point to that.. but still it looks very weird [16:36:15] it is the same thing that maven would do from our local machines? [16:36:34] ottomata: that is what I'd assume - so far it's been the same [16:36:38] hm ok [16:36:43] well, i can add this permission [16:36:49] and then you can try it? [16:37:13] done [16:37:16] https://www.irccloud.com/pastebin/hWQvkFE6/ [16:37:19] ottomata: cool [16:37:21] i'll try it [16:37:30] anyhooww, going offline folks! Talk with you tomorrow! [16:37:37] ahh laters elukey [16:37:39] looks like annotated [16:37:42] sorry we didn't look at an51 [16:37:42] joal: let me know if you'll find anything! [16:37:50] elukey: for sure ! [16:37:59] elukey: testing again (tonight job) [16:38:10] ottomata: nah not a big issue, if you have time would you mind to check if there is a clear reason for a reboot? [16:41:52] elukey: looking, uhhh, something weird [16:42:01] the node rebooted at midnight exactly [16:42:22] milimetric, I'm looking at your change, LGTM, and was thinking about the % sign change... If we change 22(%) to 0.22, we don't need % any more right? We could just make it clear in the graph title maybe? [16:43:19] ottomata: It worked I think! [16:43:35] ottomata: :O [16:43:50] I checked /var/log/mcelog because I thought about a thermal issue or whatever [16:43:53] but nothing [16:44:38] ottomata: it pushed a tag :D [16:44:58] madhuvishy: yeehaw [16:45:14] failed at pushing to archiva - which is what i expected [16:45:15] 16:42:55 [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-deploy-plugin:2.7:deploy (default-deploy) on project refinery: Failed to deploy artifacts: Could not transfer artifact org.wikimedia.analytics.refinery:refinery:pom:0.0.29 from/to archiva.releases (https://archiva.wikimedia.org/repository/releases/): Failed to transfer file: [16:45:15] https://archiva.wikimedia.org/repository/releases/org/wikimedia/analytics/refinery/refinery/0.0.29/refinery-0.0.29.pom. Return code is: 401, ReasonPhrase:Unauthorized. -> [Help 1] [16:45:43] now to figure out how to give it those powers [16:46:57] Analytics-Kanban, Operations: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#1934216 (Eevans) >>! In T123629#2143751, @MoritzMuehlenhoff wrote: > Upgrade procedure: > - Depool one of the aqs servers via conftool > - Stop restbase > - nodetool drain && systemctl stop cassandra > - u... [16:47:06] ottomata: Apr 6 13:30:24 analytics1051 cron[1806]: (CRON) INFO (Running @reboot jobs) - what the... [16:49:00] whaaa [16:49:17] elukey: also.....i dont' see datanode startup logs in hadoop-hdfs/logs [16:49:25] aroundmidnight [16:49:31] the logs just keep trucking from before midnight and beyond it [16:49:46] how did you learn that this node had rebooted? [16:49:47] that log is in syslog but now that I can see probably after the reboot.. super weeeird [16:50:15] operations channel, I have highlights and it started to throw tons of errors for each daemon [16:52:18] gotta go, have a good day folks! [16:52:21] byyyyyeeeeeeeeeeeeee [16:52:24] okbyeeweeee [16:52:28] lunchtime [17:06:24] Analytics, iOS-app-feature-Analytics, iOS-app-Bugs, iOS-app-v5.0.3-Disco: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2184046 (JMinor) p:Triage>High [17:06:45] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-feature-Analytics, iOS-app-Bugs, iOS-app-v5.0.3-Disco: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2179692 (JMinor) [17:10:05] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-feature-Analytics, iOS-app-Bugs, iOS-app-v5.0.3-Disco: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2184102 (JMinor) Thanks @Tbayer . I watched the request traffic on Charles proxy opted out on the app, and t... [17:14:24] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-feature-Analytics, iOS-app-Bugs, iOS-app-v5.0.3-Disco: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2179692 (JAllemandou) @Tbayer, @JMinor: For a webrequest to be counted as a Pageview from an app on iOS, it... [17:34:50] (CR) Mforns: [C: 2 V: 2] "LGTM! Awesome :]" [analytics/dashiki] - https://gerrit.wikimedia.org/r/281947 (https://phabricator.wikimedia.org/T131935) (owner: Milimetric) [17:52:20] Analytics-Cluster, Operations, hardware-requests, netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184248 (faidon) The port was also on the labs-instance-ports interface-range, which set the port-mode to trunk (and also added labs-instances1-eqiad to t... [18:00:06] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-feature-Analytics, iOS-app-Bugs, iOS-app-v5.0.3-Disco: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2184281 (JMinor) @Mhurd see @JAllemandou comment above for things to look for in our requests. [18:14:45] (CR) Nuria: "Ahem... let me make sure I pushed all changes." [analytics/dashiki] - https://gerrit.wikimedia.org/r/278395 (https://phabricator.wikimedia.org/T131547) (owner: Jdlrobson) [18:16:39] Analytics-Cluster, Operations, hardware-requests, netops: setup/deploy server analytics1003/WMF4541 - https://phabricator.wikimedia.org/T130840#2184309 (RobH) Ok, multiple attempts have still resulted in no joy (no dhcp request hitting carbon.) The system was also showing in the config in the def... [18:20:22] Analytics: Wikistats 2.0. Edit Reports: Setting up a pipeline to source Historical Edit Data into hdfs {lama} - https://phabricator.wikimedia.org/T130256#2184311 (Nuria) Code name for the bigger task of data gathering: "data lake" : https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake [18:21:40] madhuvishy: every step of the jenkins to archiva path is a FIGHT! [18:22:12] nuria_: well it's never been done before - i didn't think it would be otherwise [18:31:03] milimetric: tow tests fail for me, i will fix them but maybe is a bower update what is making all your tests fail? [18:37:23] *two tests [18:42:04] nuria_: I tried bower updating but didn't get anything new [18:42:24] I noticed another test was failing non-deterministicallhy [18:42:33] I hate that, and wish I could get rid of it [18:42:39] milimetric: mmm.. still only tow tests fail for me though, which they should, need to fix my code a bit [18:42:52] mmm... on my case that does not happen [18:43:31] k, cool. I'll review once you fix those and try to find that annoying nondeterministic one [18:43:35] had to do with buildHierarchy [18:44:41] Analytics-Kanban: End date not included - https://phabricator.wikimedia.org/T131641#2173602 (Milimetric) a:Milimetric [18:46:32] milimetric: ah, it is because i do not have latest code then, let me fix /push and rebase [18:47:09] Analytics-Kanban: Allow configurable number formatting - https://phabricator.wikimedia.org/T131965#2184413 (Milimetric) [19:00:53] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-feature-Analytics, iOS-app-Bugs, iOS-app-v5.0.3-Disco: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2179692 (Mhurd) a:Mhurd [19:02:00] Analytics-EventLogging, MediaWiki-Vagrant: EventLogging vagrant role fails to provision - https://phabricator.wikimedia.org/T131085#2184469 (AdHuikeshoven) @Ottomata Here is what I get on a pristine VM. First before pulling your change, and then after pulling your change. Indeed, it didn't affect this pr... [19:14:35] madhuvishy: is there any simple declarative way to have the hashed client ips in event logging? [19:14:44] that is, on a per-schema basis [19:14:50] dr0ptp4kt: we dropped it [19:15:19] completely, so no, I don't think so [19:15:24] madhuvishy: right, was just reading through that email. it didn't seem from the patches or discussion there was a means of reinstating it on a case by case basis [19:15:35] madhuvishy: ok, understood. thx [19:15:47] nope. there was only one need for it - and we waiting until quicksurveys was over [19:16:11] the field was only meant to be there for debugging purposes anyway [19:21:42] (PS1) Milimetric: Fix date range for per-article [analytics/aqs] - https://gerrit.wikimedia.org/r/281982 (https://phabricator.wikimedia.org/T131641) [19:21:49] Analytics-Kanban, Patch-For-Review: End date not included - https://phabricator.wikimedia.org/T131641#2184535 (Milimetric) Thanks for the bug report, @Nettrom. The patch I'm submitting should fix this, the documentation is the intended behavior. We will merge and deploy this hopefully this week. The bu... [19:39:29] hi a-team, got a question from Wes: Is there a dashboard for uniques? [19:40:07] are there any plans to put it in vital signs? [19:40:45] kevinator: not yet [19:40:49] there are tasks though [19:42:03] kevinator: https://phabricator.wikimedia.org/T122533 [19:42:35] thanks madhuvishy [19:43:13] BTW Wes was mentioning he saw some visualizations on uniques. Do you know if anything new was done in the last couple of weeks? [19:43:30] kevinator: where did he see it? [19:43:44] nope, just nuria did some one-offs for the blog post [19:43:49] I don't know. [19:43:58] we haven't made anything dashboard-y [19:44:06] yeah - only one offs that are there on the post [19:44:07] Yeah, I think it was mostly one-offs that have been circulating [19:44:20] kevinator: I'd be interested in hearing Wes's thoughts on vital signs, I think he had some problems with it last we spoke [19:44:27] and it sounds like he's using it so we should make it better? [19:44:43] I don't know that he is using Vital-Signs... [19:45:00] but he probably would when he looks at metrics [19:45:35] BTW any idea when uniques would be in Dashiki? [19:46:07] nuria_: ^ [19:46:12] yessir [19:46:17] sorry , madam [19:47:41] nuria_: just pointing you to kevin's question about visualizations :) [19:48:00] kevinator: we do not have an ETA, it depends on the approach we want to take, whether we will visualize data from internal data [19:48:08] the work would be very quick but not sure it has a chance before May to get done [19:48:14] kevinator: or the API we hope to be able to load our data into [19:48:46] I <3 AQS :-) [19:49:05] Kevinator: once we wrap up the browser work this week and announce it we might have a better idea: https://browser-reports.wmflabs.org/#all-sites-by-os [19:49:44] kevinator: so to sum up: our current viz work is browsers [19:50:10] ok thanks... If Wes has more questions, nuria_ I'll send them your way. [19:50:23] kevinator: we are soon to call done our first version, once we do that we will see what is next, we can let you know when we know [19:50:42] sure [20:03:41] Analytics-EventLogging, MediaWiki-Vagrant: EventLogging vagrant role fails to provision - https://phabricator.wikimedia.org/T131085#2184657 (Ottomata) Hm, what's strange is I don't see the `Virtualenv::Environment[/vagrant/srv/eventlogging/virtualenv]` created in your log. This is what installs pip into... [20:48:24] * joal has won against hadoop-cassandra [20:51:43] Analytics-Kanban: Productionitize druid - https://phabricator.wikimedia.org/T131974#2184741 (Nuria) [20:51:52] Analytics: Productionitize druid - https://phabricator.wikimedia.org/T131974#2184753 (Nuria) [20:54:48] yay joal :D [20:54:54] i'm excite [20:55:17] madhuvishy: Thanks for sharing excitement :D [21:01:06] done for tonight ! see you tomorrowe a-team ! [21:02:27] good night joal [21:03:49] laters! [21:03:56] hmm, need brainbounce, madhuvishy, yt? [21:03:58] ottomata: hashar Look at this I released some jarrrssss [21:04:03] HEYYYYYYY [21:04:04] https://integration.wikimedia.org/ci/job/analytics-release-test/38/ [21:04:20] NNICE! [21:04:21] https://archiva.wikimedia.org/repository/releases/org/wikimedia/analytics/refinery/refinery/0.0.29/ [21:04:26] NICE! [21:04:29] that's awesome! [21:04:30] joal: ^^ [21:04:53] there's a rogue version in archiva now though, sorry about that :D [21:05:10] GREAT madhuvishy !!! [21:05:20] Automatic deploys !!! YAYA ! [21:05:23] ottomata: yes we can brain bounce [21:05:38] Thanks madhuvishy for making our life easier :) [21:05:40] joal: automatic releases only so far :P [21:05:52] riiight, good enough to party ;) [21:05:59] and this is all experimental for now - need to actually write config code etc [21:06:05] madhuvishy: batcave? [21:06:16] ottomata: yup [21:06:52] (CR) Joal: [C: 1] "LGTM, merge when you want." [analytics/aqs] - https://gerrit.wikimedia.org/r/280934 (https://phabricator.wikimedia.org/T131369) (owner: Milimetric) [21:07:42] madhuvishy: ? [21:07:47] ottomata: sorry [21:07:53] connection died joining back [21:15:20] madhuvishy: super nice !!!!!!!! [21:15:51] maybe we could reuse that logic for a wild bunch of maven based project we have ;-} [21:16:26] discovery and mobile have a few [21:21:53] hashar: yes! once it's all translated to yaml [21:21:58] thanks ;) [21:22:35] HMM, madhuvishy our conversation just made me think about hafnium and other places eventlogging is deployed [21:22:56] i did not do due diligiince on those boxes when i ended the python setup.py install for deployment [21:22:57] hmmm [21:23:01] aah [21:23:04] they still have eventlogging globally installed [21:23:37] right [21:24:14] HMMM [21:24:15] but [21:24:16] hmmm [21:24:20] does it even use eventlogging? [21:24:25] i see stuff there just using ZMQ [21:24:39] ah i found one [21:24:42] ottomata: uhh i remember atleast one [21:24:51] that does import eventlogging [21:25:02] yeah, ve [21:25:11] that stuff should all change to kafka anyway [21:25:25] they are just consuming one topic [21:26:04] well, ha, not really, they still use ZMQ [21:26:08] which is eventlogging-valid-mixed [21:26:18] but ja from the ZMQ forwarder [21:26:19] ottomata: oh ya but they only need one topic right? [21:26:32] for meta in events.filter(schema='Edit'): [21:26:33] yup [21:26:40] they should consume from that Kafka topic :/ [21:26:44] yup [21:26:48] ok, i'm going to make a task to fix this [21:26:55] it already exists [21:26:56] they need to run code out of eventlogging dir, not global [21:26:58] oh? [21:27:01] yes yes [21:27:05] you made it :) [21:28:34] ottomata: https://phabricator.wikimedia.org/T110903 [21:29:18] Analytics-EventLogging, Analytics-Kanban, Scap3 (Scap3-Adoption-Phase1): Stop using global eventlogging install on hafnium (and any other eventlogging lib user) - https://phabricator.wikimedia.org/T131977#2184882 (Ottomata) [21:29:23] ah and that one [21:29:24] ja [21:29:36] Analytics-EventLogging, Analytics-Kanban, Scap3 (Scap3-Adoption-Phase1): Stop using global eventlogging install on hafnium (and any other eventlogging lib user) - https://phabricator.wikimedia.org/T131977#2184899 (Ottomata) Also related: https://phabricator.wikimedia.org/T110903 [21:29:57] ottomata: is there a need to use eventlogging at all - can they just directly consume from kafka? [21:30:45] sure they can [21:30:48] that would be totally fine [21:30:55] they aren't using EL in some of their consumers anyway [21:31:42] ya [21:31:46] that would be coool [21:32:18] then no need to deploy EL to hafnium at all [21:35:26] ja [21:35:41] ok madhuvishy i can test this git::clone andmake it work on stat1002 wihtout affecting anything else [21:35:43] thanks for the help [21:35:48] think i'm heading out for theday pretty soon [21:35:51] sooOooO TTYT! [21:35:56] ottomata: np! cya [22:16:25] Analytics-Kanban, Release-Engineering-Team: [Spike] Figure out how to automate releases with jenkins {hawk} - https://phabricator.wikimedia.org/T130576#2185058 (madhuvishy) More things! 11. It failed saying it couldn't push a tag because the test user needed Push Annotated Tag permissions in Gerrit. Added... [22:37:18] Analytics-Kanban, Release-Engineering-Team: [Spike] Figure out how to automate releases with jenkins {hawk} - https://phabricator.wikimedia.org/T130576#2139858 (greg) well done, @madhuvishy! (and @dduvall!) [23:14:20] Analytics, Developer-Relations, MediaWiki-API, Reading-Admin, and 4 others: Metrics about the use of the Wikimedia web APIs - https://phabricator.wikimedia.org/T102079#2185375 (bd808) [23:14:22] Analytics, MediaWiki-API, Reading-Infrastructure-Team, MW-1.27-release-notes, and 3 others: Publish detailed Action API request information to Hadoop - https://phabricator.wikimedia.org/T108618#2185371 (bd808) Open>Resolved Schema documented on wiki at https://wikitech.wikimedia.org/wiki/Anal... [23:30:36] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-feature-Analytics, iOS-app-Bugs, iOS-app-v5.0.3-Disco: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2185551 (Mhurd) @TBayer heya I just confirmed the code responsible for the user agent regression :) I'll try... [23:36:04] Analytics, Wikipedia-iOS-App-Product-Backlog, iOS-app-feature-Analytics, iOS-app-Bugs, iOS-app-v5.0.3-Disco: Invalid pageview data for iOS app - https://phabricator.wikimedia.org/T131824#2185584 (Mhurd) To clarify, until we release the 5.0.3 update with the fixed `WikipediaApp` user agent, you'... [23:48:58] milimetric: Are there WMF projects that the pageviews API doesn't work for? It seems to fail for www.mediawiki.org [23:49:15] https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/www.mediawiki/all-access/all-agents/Help%3AVisualEditor%2FUser_guide/daily/2016030100/2016033100