[06:24:40] Analytics-General-or-Unknown, Language-Engineering, Mobile-Apps, Wikipedia-Android-App-Backlog, and 2 others: there should be a comparison of clicks count on interlanguage links on different platforms - https://phabricator.wikimedia.org/T78351#2303921 (Arrbee) [07:15:00] mobrovac: Buongiorno! Whenever you have 10 minutes I'd need your help with AQS deployment, or just a gentle RTFM for docs that I can't find :D [07:16:51] elukey: what's up? [07:17:07] saw a lot of alerts passing by yesterday for it [07:18:04] most of which were ok, but the aqs root url was worrisome [07:18:38] yeah new hosts, so no real prod traffic [07:19:13] I ran puppet on them and then tons of alerts started to scream [07:20:00] first question is, I am not able to deploy from tin to aqs100[456] (precisely https://gerrit.wikimedia.org/r/#/c/289224/) [07:20:02] yes, that's the usual behaviour [07:20:31] the error seems to be the same that I had a while ago in beta, namely my username not able to access the ssh keys holder [07:20:51] k, gimme a sec [07:21:43] hmmm, Permission denied (publickey,keyboard-interactive) [07:21:47] oh ok [07:22:25] elukey: i think you need to manually pretend you're the deployment user and log in to the hosts in order to accept the keys [07:23:47] mobrovac: do you mean sudo -u deploy-service scap deploy? [07:23:53] feels like cheating [07:23:57] :D [07:23:59] no, i mean: [07:24:12] SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service aqs1004.eqiad.wmnet [07:24:15] from tin [07:24:26] trying.. [07:25:10] it tells me the same, Agent admitted failure to sign using the key. [07:25:24] because I can't access the key holder [07:25:48] the strange thing is that I haven't found docs about it [07:25:58] ok, let's try to rearm the keyholder [07:26:03] as root, run keyholder arm [07:28:42] mobrovac: mmm not really confident to mess with tin, even if it should be a simple op. Maybe we can move to security so others will be aware? [08:23:59] Analytics, Wikipedia-Android-App-Backlog: Investigate recent decline in views and daily users - https://phabricator.wikimedia.org/T132965#2304104 (JAllemandou) Code has been deployed the day before yesterday at 17:00 UTC. Yesterday number for enwiki mobile apps in vital signs seems aligned with before-b... [08:24:14] Hi elukey [08:24:30] hey joal [08:24:46] found your way through scap, or not yet ? [08:24:58] Marko figured out the mistery, namely https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/hieradata/common/scap/server.yaml;b893d369413b7553d77b46c6144103df5cfab6a1 [08:25:25] trusted_group for deploy-service is also aqs-admin, in which you are and I am not :) [08:25:55] and you are not in deploy-service either? [08:26:08] nope! [08:26:14] I thought ops has ALL the rights ;) [08:26:54] Alex filed a patch some days ago to add ops to the trusted groups of the keyholder by default, still wip [08:26:57] :) [08:27:14] one can brutally sudo of course but it doesn't solve the problem correctly [09:02:54] elukey: Heya [09:03:22] o/ [09:03:42] elukey: question: I'd like to transfer data from altiscale (research cluster) to eqiad (stat1004 for hdfs), about 100GB, and I'd like to prevent having all that cross the atlantic ... Any idea? [09:06:06] netcat! [09:06:21] but ports needs to be opened of course [09:07:35] for example, for the redis backups I created a tar.gz and piped through netcat, then opened a netcat -l on the other side (same port of course) [09:09:50] in this case, 100GB of tarbal might be a bit aggressive to create :P [09:11:33] yeah might be needed, because the netcat on stat1004 will not be able to discriminate files [09:11:54] it will just forward blobs of data to something, like > out.tar.gz [09:11:58] joal: ---^ [09:12:04] http://www.microhowto.info/howto/copy_a_file_from_one_machine_to_another_using_netcat.html [09:12:56] awesome elukey, will try that ! [09:13:17] What ports should I use from stat1004? [09:16:05] joal: try with something > 1024 like 6666 [09:16:26] elukey: THE DEVIL'S PORTTTTT AHHHHHHHHH ! [09:17:31] hahahahaha [09:18:04] May 18 09:10:13 aqs1004 cassandra[21918]: java.lang.RuntimeException: Unable to gossip with any seeds [09:18:12] buuu [09:18:22] :( [09:18:28] * joal pads elukey on the back [09:18:55] elukey: Actually, how will I connect to stat1004 from outside of our network ... I think that won't work :( [09:19:26] ahhh I thought that altiscale was the name of a host :( [09:19:35] no it won't then... [09:19:37] :( [09:21:09] maybe we can ask to moritzm if there is any way [09:22:05] we have 100GB of data on https://www.altiscale.com/ (research cluster) and we'd need to transfer them to stat1004 to upload it to hadoop [09:22:13] any way that this is doable? [09:26:59] HTTPS_PROXY=http://webproxy.eqiad.wmnet:8080 curl https://www.altiscale.com/somefile.tgz -o somefile.tgz [09:28:48] that + python -m SimpleHTTPServer, in a pinch [09:29:21] ahhhh wow thanks ori! [09:29:44] joal: simpler than I thought, I told you that you don't have to pay attention to what I say :P [09:30:08] heh, np. and now i'm really off [09:32:30] elukey: concern is, I can't serve those files through a domain :( [09:32:41] netcat using tthp proxy ? [09:33:01] man, that sounds a bit complicated but can't think of anything better [09:33:34] ok, need to ba AFK for a while, will get back to this after [09:40:15] joal: going AFK as well, let's chat in the batcave after lunch! [09:40:28] * elukey afk for a bit + lunch [09:48:14] Analytics-Kanban, Operations, ops-eqiad, Patch-For-Review: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2304310 (fgiunchedi) the raid arrays issue might be related to {T131961} though that should be fixed already in puppet for jessie, modulo rebuild of initramfs [10:27:24] Analytics-Kanban: Create edit data hadoop/druid schemas for anaylitcs - https://phabricator.wikimedia.org/T134793#2304355 (mforns) a:mforns [13:07:23] * elukey is trying to bootstrap a cluster with https://wikitech.wikimedia.org/wiki/Cassandra#Bootstrap_a_brand_new_cluster [14:44:06] I just spent two hours debugging something that was trivial [14:44:12] namely adding -p to nodetool [14:44:24] I am sad [14:46:56] whups [14:47:13] elukey: we're about ready to merge https://gerrit.wikimedia.org/r/#/c/284078/ [14:47:55] which lets you choose what Cassandra version to configure-for on a host [14:48:11] elukey: it defaults to 2.1, and so should be a no-op [14:48:46] elukey: but we were planning to disable puppet everywhere, and use restbase staging as a canary, just to be sure [14:49:30] urandom: saw it, looks good! [14:49:34] elukey: you have 3 new nodes up by now though, yes? aqs100[4-6]? [14:49:55] elukey: just want to make sure the right things is done here [14:50:04] s/things/thing/ [14:50:06] urandom: I am bootstrapping them with a testing aqs cluster atm, only one cassandra instance left [14:51:00] urandom: https://gerrit.wikimedia.org/r/#/c/288373/ - I wanted to touch base with you after this test to explain [14:51:36] with this one we should be able to test compaction etc.. loading cassandra with some data [14:51:59] and then once we'll be ready we'll re-image and add one instance at the time to the current aqs cluster [14:52:02] elukey: sounds good [14:53:04] elukey: akosiaris is going to be helping me with the merge (in a couple of hours), should we disable puppet on your new hosts as well? [14:53:23] already disabled :) [14:54:45] elukey: kk [15:05:30] elukey, urandom: If cassandra 2.2.6 doesn't seem to fail, I guess we could also try it from the beginning [15:05:34] joal: cassandra cluster up! \o/ [15:06:23] * joal dances the victory dance for elukey b [15:07:17] elukey: you let me know when you want us to start plying with it [15:07:54] joal: not yet, but if you want to check on the hosts you can use nodetool-a status [15:08:08] ok cool [15:08:09] :) [15:08:14] * joal is impatient [15:08:32] I also found https://gerrit.wikimedia.org/r/#/c/289424/1 [15:08:36] typo :D [15:09:25] elukey: 1 on 1? [15:09:34] elukey: not sure what is up with remainder for that meeting [15:10:42] nuria_: Joining, really sorry, didn't get the memo :( [15:10:49] jaja [15:11:03] elukey: np [15:36:53] joal: https://ganglia.wikimedia.org/latest/?c=Analytics%20Query%20Service%20eqiad&m=mem_report&r=week&s=by%20name&hc=4&mc=2 [15:39:18] Analytics-Kanban, Patch-For-Review: Test cassandra compactions on new AQS nodes - https://phabricator.wikimedia.org/T135145#2305439 (elukey) Cluster up and running! ``` elukey@aqs1006:~$ nodetool-a status Datacenter: eqiad ================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Addre... [15:39:35] urandom: 3-4140-b59c-66fbdc16af6a rack1 [15:39:45] argh sorry https://phabricator.wikimedia.org/T135145 [15:42:02] joal: the cluster is up and I also ran puppet, all good [15:42:31] awesome elukey [15:45:08] joal: any luck with the data transfer problem? [15:45:25] took a different aproach :) [16:05:35] elukey: nice! [16:15:55] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2305631 (Nuria) @GWicke: Per our conversation on irc "nuria_: it's very unlikely to trigger, tbh as discussed with @milimetric, the limits are enforced per worker, and are set relatively high" These workers ar... [16:22:41] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2305661 (Nuria) @GWicke : can you explain a bit what "worker" is on this context? [16:36:44] mforns, milimetric : https://plus.google.com/hangouts/_/wikimedia.org/a-batcave2 [16:37:00] joal, we're there... no? [16:37:10] batcave-2 [16:37:13] a-batcave-2 [16:37:14] Arfff [17:13:21] elukey: https://phabricator.wikimedia.org/T95253 [17:13:40] elukey: that needs to be addressed before you move these nodes to production [17:14:18] elukey: i'd recommend upgrading the existing nodes to 2.1.13, and installing 2.1.13 on the new nodes when you reimage [17:16:24] urandom: got it, will read and upgrade [17:16:30] should be a puppet change right? [17:17:15] no, at some point, we didn't get aqs on 2.1.13 when we rolled that out elsewhere, and in the meantime someone upgraded the apt repo to 2.1.14 [17:18:20] elukey: so we should get the apt repo downgraded to 2.1.13, and upgrade your existing nodes before bootstrapping any of the new ones into that cluster [17:19:26] elukey: so to answer your question more directly, it's a package upgrade (no puppet needed) [17:20:49] urandom: okok makes sense [17:27:40] urandom: so the current installed version $everywhere is 2.1.13, but on apt there is 2.1.14? [17:57:28] * elukey going afk, will double check tomorrow! [17:57:33] byeeeeee [17:59:07] Analytics-General-or-Unknown, Language-Engineering, Mobile-Apps, Wikipedia-Android-App-Backlog, and 2 others: there should be a comparison of clicks count on interlanguage links on different platforms - https://phabricator.wikimedia.org/T78351#843122 (JMinor) One caveat in comparing across apps a... [18:00:06] ottomata: still nothing from the logs? [18:00:56] mmm where are the logs? [18:01:53] elukey: the data or the log files? [18:01:59] log files! [18:02:00] log files are in /var/log/kafka, data in /var/spool/kafka/* [18:02:25] yeah still no change :/ [18:02:45] ah it is server.log [18:02:50] I was checking kafka.log [18:03:36] :( [18:03:50] will check later, really weird! Ping me if you need me ottomata [18:03:52] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2306183 (GWicke) @nuria, as described [in the documentation](https://github.com/wikimedia/service-runner#rate-limiting), the default backend is a simple in-memory backend. This enforces request limits per servi... [18:21:34] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2306275 (Nuria) @Gwicke: the documentation on this regard is real meager, seems to me that to understand how it works you need prior knowledge about the inners of the service, in this case from your comment I u... [18:23:21] yeah server.log was the old one, now the loggers have individual log files [18:25:05] elukey: are you changes affecting asq production? there are some cassandra alarms about timeouts [18:49:01] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2306357 (GWicke) @Nuria, I walked through all of this with @Milimetric, and also gave you a link to an example config in T135240#2302880. My recollection from that conversation is that @Milimetric was planning... [18:50:16] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2306362 (Nuria) @Gwicke: we already sync-ed up on this and both @Milimetric and myself agreed that we want limits for the "entire" api [18:50:53] (PS4) Nuria: Initial content of analytics.wikimedia.org [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/289062 (https://phabricator.wikimedia.org/T134506) [18:51:56] (CR) Nuria: "Please take a look and let me know if UI wise you think we are Ok to have this as our 1st deploy" [analytics/analytics.wikimedia.org] - https://gerrit.wikimedia.org/r/289062 (https://phabricator.wikimedia.org/T134506) (owner: Nuria) [19:11:21] moving to cafe back shortly [19:20:36] nuria_: 500 in aqs were due to request spike, as usual [19:45:47] Analytics-Kanban: Enable rate limiting on pageview api - https://phabricator.wikimedia.org/T135240#2306644 (Milimetric) What I (mis)understood from the IRC conversation we had was that the "basic kademlia configuration" as described in the documentation was already enabled on the RESTBase cluster that uses t... [19:50:30] joal: let's see if I can get clarity on the throttling [19:58:46] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, and 2 others: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#2306752 (Milimetric) @kaldari that bug only affected beta and it was only for a couple of days, it's... [20:16:42] nuria_: Hi! [20:16:48] hola! [20:20:27] nuria_: the new cassandra cluster is not connected to the prod one, as joal was saying the usual timeouts.. :( [20:20:49] elukey: right, just wanted to triple check those alarms were coming from prod [20:21:07] yep and better double check, thanks for the ping :) [20:31:18] raaaats elukey no change huh [20:31:19] :( [20:35:36] going to do the broker bounce dance [20:35:55] elukey: i'm going to set a short retention time on a broker, then bounce it, then wait for it to come back up and delete logs, and then reset it back to 7 days, and bounce again [20:36:06] i will do this on kafka1013 first, since it is not currently a leader [20:37:34] Analytics-Cluster, Analytics-Kanban, Operations, Traffic, Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2306885 (Ottomata) Grr, these are getting close to full. Luca and I tried to dynamically set topic retention, but kafka d... [20:41:23] hmmmm, wait a minute, it is possible my previous retention.ms was a bad setting, going to try it again [20:41:24] first [20:45:13] yay it is deleting! [20:46:12] Analytics-Cluster, Analytics-Kanban, Operations, Traffic, Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2306971 (Ottomata) I take it back! The command I had run previously looks like it had a larger retention.ms than the defa... [21:10:02] Analytics-Cluster, Analytics-Kanban, Operations, Traffic, Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2307079 (Ottomata) Ok, brokers have deleted webrequest_upload data older than 48 hours. I've removed the topic config ove... [22:06:11] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, and 2 others: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#2307316 (Krenair) >>! In T115119#2306752, @Milimetric wrote: > Also, I noticed that searching for "Ex... [23:07:15] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, and 2 others: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#2307475 (kaldari) [23:10:37] Analytics, MediaWiki-extensions-WikimediaEvents, The-Wikipedia-Library, Wikimedia-General-or-Unknown, and 2 others: Implement Schema:ExternalLinkChange - https://phabricator.wikimedia.org/T115119#2307479 (kaldari) @Sadads: It looks like there's a config variable for turning this particular event-...