[07:35:29] Analytics, Commons, Multimedia, Tabular-Data, and 3 others: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...) - https://phabricator.wikimedia.org/T120452#2755777 (Yurik) [09:59:33] Analytics, Dumps-Generation, Security: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100#2755917 (Bawolff) New theory. From what I understand, the analytics cluster determines page name by trying to parse the path of... [10:17:30] Analytics, Dumps-Generation, Security: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100#2755950 (Bawolff) So assuming that's correct, this leaves the question of should we do anything about it? From a security perspec... [11:00:16] joal: o/ [11:00:38] not sure if you are working today buuut I am restarting Cassandra on AQS for OpenJDK upgrades [11:08:51] Analytics, Pageviews-API: Non existing article is one of the most viewed according to the data returned by the /metrics/pageviews/top/ API - https://phabricator.wikimedia.org/T149178#2756054 (mobrovac) [11:41:01] Hi elukey ! [11:41:11] I'm here today :) [11:42:36] Please restart when you want :) [12:04:54] joal: done :) [12:05:35] this time no 50x, system_auth didn't interfere :) [12:05:50] elukey: :) [12:06:28] * elukey lunch! [13:44:34] Hey elukey [13:47:02] o/ [13:50:37] Would you mind merging a CR for me to deploy (still making a last change) [13:50:44] elukey: --^ [13:52:08] I can review it but you can merge it without issues :) [13:52:10] (PS3) Joal: Update cassandra load jobs to match new aqs needs [analytics/refinery] - https://gerrit.wikimedia.org/r/315241 (https://phabricator.wikimedia.org/T147841) [13:52:15] elukey: --^ [13:52:41] joal: I can review it but you should be the one +2ing it no? [13:54:44] (CR) Elukey: [C: 1] Update cassandra load jobs to match new aqs needs [analytics/refinery] - https://gerrit.wikimedia.org/r/315241 (https://phabricator.wikimedia.org/T147841) (owner: Joal) [13:54:52] LGTM :) [13:57:51] Ok thanks elukey [13:57:58] merging, deploying restarting :) [13:58:33] (CR) Joal: [C: 2 V: 2] "Reviewed, merging for deploy" [analytics/refinery] - https://gerrit.wikimedia.org/r/315241 (https://phabricator.wikimedia.org/T147841) (owner: Joal) [14:00:28] !log deploy refinery [14:00:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:09:04] elukey: I have messed up deploy after a failed connection :( [14:09:54] elukey: on tin, I didn't manage to deploy fully after canary (ssh connection locked) [14:10:18] elukey: Now it tells me deploy is locked [14:10:25] this is new :) [14:10:28] checking [14:10:50] elukey: do I have a way to have it continue deploy instead of restart? [14:10:54] thanks mate [14:11:51] fyi, i'm adding kafka1003 as a broker into main-eqiad [14:11:56] not adding replicas yet [14:11:57] ottomata: o/ [14:11:59] hiyaa [14:12:14] so /srv/deployment/analytics/refinery/scap/deploy.lock is there, maybe deleteing it will make everything work [14:12:19] but I have never done it [14:12:26] there could be a clean way to recover [14:12:31] ottomata: --^ ? [14:12:51] !log adding kafka1003 as kafka broker in main-eqiad cluster [14:12:56] so refinery's deployment is blocked because the .lock is held by a previous session started by joal (connection broken) [14:12:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:13:00] elukey, ottomata : maybe forcing a new deploy ? [14:13:21] hm, ah, i'm not sure! sorry! just merged kafka thing, with you shortly... [14:13:51] sure [14:14:29] elukey: trying to force? [14:14:43] joal: mmm I don't think it will work [14:14:54] we should remove the lock, I am almost sure [14:15:10] elukey: ok, I trust you [14:15:23] elukey: I think I don't have right, can you go for it ? [14:17:23] I am sanity checking the whole thing, because on the canary we have already flipped the new config meanwhile not on the target [14:17:33] I am thinking if this can cause issues [14:17:35] elukey: right [14:17:43] but the only one I can think of is in case of rollback [14:17:45] elukey: I'm very sorry [14:18:00] joal: don't be, it happens! [14:20:21] mmm joal the file is not there anymore [14:20:25] can you retry deploying? [14:20:57] sorry sorry a sec :) [14:21:20] yeah it is not there anymore [14:21:36] joal: let's try to deploy again [14:24:14] k elukey [14:24:16] trying [14:25:49] elukey: deploy successful, but it didn't write me machines [14:26:15] ?? [14:26:21] what do you mean? [14:27:18] it seems it has deployed in stat1002 but not others [14:27:21] elukey: [14:27:38] ah yes I can see in deploy log [14:27:48] so basically it stops at the canary since it is up to date [14:28:02] we need the --force [14:28:08] ok, will do [14:28:51] elukey: Thanks ! It's now deployong onto other nodes [14:29:06] \o/ [14:29:20] thanks again :) [14:31:20] :) [14:49:42] !log adding kafka1003 in as replicas for active main-eqiad topics [14:49:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:52:29] Analytics-EventLogging, Patch-For-Review: Provide a robust way of logging events without blocking until network request completes; use sendBeacon - https://phabricator.wikimedia.org/T44815#2756765 (Nuria) Closing as changes are final for sendBeacon [14:52:40] Analytics, MediaWiki-extensions-MultimediaViewer, Multimedia: Image link clicks might not get logged - https://phabricator.wikimedia.org/T58426#2756769 (Nuria) [14:52:42] Analytics-EventLogging, Patch-For-Review: Provide a robust way of logging events without blocking until network request completes; use sendBeacon - https://phabricator.wikimedia.org/T44815#2756768 (Nuria) Open>Resolved [14:53:29] Analytics, Dumps-Generation, Security: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100#2756788 (Nuria) @Bawoff: right, this should be a 400 in mw end. We have driven similar changes recently so we return 404s from mw... [14:59:51] Analytics, Dumps-Generation, Security: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100#2756804 (Bawolff) Arguably from a mw prespective this is a good and valid request - curid parameter takes precedence over title p... [15:21:18] nuria: how do you want us to proceed with deployment? [15:21:32] nuria: you want to try it, or shall I do it? [15:21:45] joal: let me merge the java change and we can deploy together after the debrief? [15:21:54] nuria: sure :) [15:22:31] (CR) Nuria: [C: 2] Enhancing regex to support pageviews to non-knowledge wikis [analytics/refinery/source] - https://gerrit.wikimedia.org/r/316845 (https://phabricator.wikimedia.org/T130249) (owner: Nuria) [15:22:53] joal: I have not deployed before so i would like to do it with a bit of oversight [15:23:21] nuria: no problem, it's well documented and should be easy enough [15:25:36] (Merged) jenkins-bot: Enhancing regex to support pageviews to non-knowledge wikis [analytics/refinery/source] - https://gerrit.wikimedia.org/r/316845 (https://phabricator.wikimedia.org/T130249) (owner: Nuria) [15:29:50] a-team sorry, brt, gotta use br [15:33:14] Analytics, Analytics-Kanban, Pageviews-API: Non existing article is one of the most viewed according to the data returned by the /metrics/pageviews/top/ API - https://phabricator.wikimedia.org/T149178#2756982 (Nuria) [15:40:29] Analytics, Easy: Fix layout of the daily email that sends pageview dataset status - https://phabricator.wikimedia.org/T116578#2757054 (Nuria) [15:45:16] Analytics: Fix the pageview API "top" spec and 404 reporting {slug} - https://phabricator.wikimedia.org/T117018#1764677 (Nuria) We can remove trailing backslashes on aqs, I think the 404 issues are taken care of. [15:52:24] Analytics, Analytics-Dashiki: Add pivot parameter to tabular layout graphs {lama} - https://phabricator.wikimedia.org/T126279#2757099 (Nuria) [16:00:19] sorry elukeywrong button [16:01:13] :P [16:04:33] a-team: connection issues, trying to join [16:04:51] appear.in if it doesn't work? [16:05:56] joal: you can send us a phone number [16:06:56] sure nuria : +33612620387 [16:07:52] we cannot call you intnl calls cost $$$ turns out [16:08:42] I don't get it .. woeked fine mminutes ago :( [16:09:30] nuria: is there a number I can call? [16:09:43] joal: no, there is not [16:09:49] joal: is it your connection? [16:10:08] it is [16:10:47] can you thether? [16:10:52] trying and trying, but no success :( [16:28:58] ok, nuria, ready ? [16:29:07] yes, batcave? [16:29:24] nuria: if my connection allows ;) [16:29:26] trying [16:29:29] k [16:37:29] (PS1) Nuria: Changes for refinery source v.0.0.36 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/318958 [16:39:14] (PS2) Nuria: Changes for refinery source v.0.0.36 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/318958 [16:39:45] (PS3) Nuria: Changes for refinery source v 0.0.36 [analytics/refinery/source] - https://gerrit.wikimedia.org/r/318958 [16:40:27] (CR) Joal: [C: 2 V: 2] "Good" [analytics/refinery/source] - https://gerrit.wikimedia.org/r/318958 (owner: Nuria) [16:52:51] So nuria, when jenkins has finished, there is another jenkins job to launch, to automagically pull the new archiva jars into refinery [16:53:36] joal: the one documented , yes [16:53:40] joal: will do [16:53:44] cool [16:54:08] nuria: after that; scap deploy of refinery: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Refinery [16:54:23] And then you should all set to restart the pagview job :) [16:54:52] joal [16:54:55] [16:54:56] Analytics, EventBus: Delete stale topics from main Kafka clusters - https://phabricator.wikimedia.org/T149594#2757362 (Ottomata) [16:55:06] joal: ok! [17:00:22] !log kafka preferred replica election on main-eqiad kafka cluster to promote kafka1003 as leader for its preferred partitions [17:00:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:09:08] !log bouncing eventlogging [17:09:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:10:44] nuria: I have checked archiva, your jars are there [17:11:04] joal: k, on meeting can talk in abit [17:15:32] Analytics-Kanban, EventBus, Operations, Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2757461 (Ottomata) [17:16:36] Analytics-Kanban, EventBus, Operations, Patch-For-Review: setup/install/deploy kafka1003 (WMF4723) - https://phabricator.wikimedia.org/T148849#2734746 (Ottomata) Looking good! https://config-master.wikimedia.org/conftool/eqiad/eventbus [17:23:44] disconnecting for now, I'll be back in less than an hour [17:23:49] nuria: --^ [17:24:27] Analytics, Analytics-Cluster, Operations, Research-and-Data, and 2 others: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#2757492 (ellery) [[ http://www.geforce.com/hardware/10series/titan-x-pascal | This ]] is the GPU we would like to order. [17:36:22] * elukey afk! [18:17:13] nuria: Back ! [18:18:40] nuria: from what I have seen, jars are updated in refinery, but refinery is not yet deployed [18:38:57] joal: right out of meetings, continuing now [18:40:43] ok cool nuria :) [18:40:53] Let me know if you need anything [18:40:55] ottomata: i do not have permits to ssh into tin? [18:45:40] nuria: i think tin is gone or moved [18:45:41] try [18:45:49] deployment1001.eqiad.wmnet [18:45:56] hm [18:46:03] no [18:46:03] just [18:46:11] deployment.eqiad.wmnet [18:46:11] actually [18:46:11] it is still tin [18:46:15] but it might have been reinstalled, so you might have to clear your host ssh key [18:47:33] ottomata: k, indeed, thank you [18:53:42] tin was reimaged (and mira was temporarily the deployment server, but it switched back) [18:55:37] Analytics-Kanban, Wikipedia-iOS-App-Backlog, iOS-app-feature-Analytics: Drop in iOS app pageviews since version 5.2.0 - https://phabricator.wikimedia.org/T148663#2757922 (JMinor) [19:12:59] nuria: going for diner, will check some jobs when getting back [19:14:33] nuria: looks like you have deployed eveything :) [19:14:45] joal: ya, but i did not restarted jobs [19:15:02] ok [19:15:22] joal: can i do that via hue? [19:15:25] nuria: we usually try to log deploy actions in the chan [19:15:51] nuria: you can kill the previous with hue, but I've never started a new job using hue [19:16:09] I think it's not feasible for prod jobs since they need being launched by hdfs user [19:17:59] nuria: command example : [19:18:46] sudo -u hdfs oozie job --oozie $OOZIE_URL -Drefinery_directory=hdfs://analytics-hadoop$(hdfs dfs -ls -d /wmf/refinery/2016* | tail -n 1 | awk '{print $NF}') -Dqueue_name=production -Doozie_launcher_queue_name=production -Doozie_launcher_memory=256 -Dstart_time=2016-10-31T17:00Z -config /srv/deployment/analytics/refinery/oozie/pageview/hourly/coordinator.properties -run [19:19:15] now out for dinner ;) [19:19:59] ottomata: I restarted varnishkafka on cp3045 and cp2018 [19:20:17] Icinga reported some errors, and it seems librdkafka related [19:20:32] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&from=now-3h&to=now [19:21:01] super weird, not sure what happened in there [19:21:18] can see tons of varnishkafka[2395]: KAFKADR: Kafka message delivery error: Local: Message timed out [19:21:22] joal:k [19:21:38] and also errors to kafka1018 [19:22:16] hmm weird ok [19:22:41] in the graph you can isolate them to see the impact.. cp2018 was mild, but 3045 was a bit big [19:23:18] that was delivery callback error related [19:23:53] (cp3045 is upload, cp2018 misc, now I explained the difference :) [19:33:27] Analytics, Beta-Cluster-Infrastructure, WikimediaPageViewInfo: Deploy WikimediaPageViewInfo extension to beta cluster - https://phabricator.wikimedia.org/T129602#2758048 (Nuria) @greg:(sorry I did not see this ticket until today) There are two topics here: 1) Setting up an analytics stack in beta 2)... [19:35:23] ah [19:35:23] h [19:35:24] hm [19:37:11] Analytics, Beta-Cluster-Infrastructure, WikimediaPageViewInfo: Deploy WikimediaPageViewInfo extension to beta cluster - https://phabricator.wikimedia.org/T129602#2758064 (Jdforrester-WMF) >>! In T129602#2758048, @Nuria wrote: > Now, if this extension needs to be deployed ASAP then deploy to phase 0 w... [19:51:03] ottomata: how can I kill the pageview hourly job via oozie? [19:51:07] sorry via hue [19:52:15] ottomata: this one? https://hue.wikimedia.org/oozie/list_oozie_coordinator/0024746-160420145651441-oozie-oozi-C/ [19:54:06] nuria: you want to stop the coordinator? [19:54:10] you can pause if you want? [19:54:15] or do you want to just kill it? [19:54:16] ottomata: I want to restart jobs with new deploy [19:54:19] that looks like the right coordinator [19:54:20] ok eyah [19:54:28] ottomata: so i think kill no? [19:54:30] you can click kill on the left hadn side [19:54:30] yeah [19:54:32] that would do it [19:55:01] ottomata: man , where? I see not buttons ... [19:55:17] left hand side [19:55:30] under Manage [19:55:42] you can also do it via CLI [19:55:45] oozie job -kill 0024746-160420145651441-oozie-oozi-C [19:55:49] ah , ok [19:55:53] i wil do just that [19:56:12] i forgot [19:56:30] ok [19:56:35] ottomata: soorryyy [19:56:42] i should have remembered [19:57:18] ottomata: and .. do i need to restart or it will pick up latest jars next time it runs if i do not restart it? [19:57:38] you need to resubmit it [19:57:42] as a new coord job [19:57:47] if you want it to pick up new jars [19:57:57] since we use static paths to deployments in hdfs when starting jobs [19:58:35] k [19:58:38] got it [20:36:19] ottomata: I have ping-ed chris on the creation of depot for node module. hopefully i can move code tomorrow [20:36:58] I'll create the repo today. But it'll be a few hours before I get to it. [20:37:17] qchris_ is always WATCHING! [20:37:23] :-P [20:37:30] qchris_: ya. no rush. [20:37:36] qchris_: thank you. [20:45:04] ok cool [20:45:05] haha :) [20:59:29] nuria: just checked, looks like the pageview job is gently running :) [21:00:00] nuria: Thanks for having deployed :)