[01:31:54] (03PS1) 10GoranSMilovanovic: info [analytics/wmde/WDCM-WikipediaSemantics-Dashboard] - 10https://gerrit.wikimedia.org/r/474617 [01:31:56] (03CR) 10GoranSMilovanovic: [V: 032 C: 032] info [analytics/wmde/WDCM-WikipediaSemantics-Dashboard] - 10https://gerrit.wikimedia.org/r/474617 (owner: 10GoranSMilovanovic) [02:39:36] 10Analytics, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs, 10iOS-app-feature-Analytics, 10iOS-app-v6.2-Beluga-On-A-Pogo-Stick: Many errors on "MobileWikiAppiOSSearch" and "MobileWikiAppiOSUserHistory" - https://phabricator.wikimedia.org/T207424 (10chelsyx) 05Open>03Resolved v6.1.1 was released on... [02:44:35] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is CRITICAL: CRITICAL [02:54:39] I just had an error from JupyterHub on notebook1004 that said that the "database or disk is full" [04:25:39] There were also error msgs on #wikimedia-operations earlier today, including "PROBLEM - Disk space on notebook1004 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=83%)" [05:59:04] ah [06:00:34] I freed about ~2GB of data files I didn't need any more. [06:03:58] not sure if what I did makes a difference [06:04:44] df says there should be 29G free in / [07:13:00] 10Analytics, 10Analytics-Kanban: Upgrade Matomo to 3.6.1 or 3.7.0 - https://phabricator.wikimedia.org/T209808 (10elukey) p:05Triage>03Normal [07:13:34] 10Analytics, 10Analytics-Kanban: Upgrade Matomo to 3.6.1 or 3.7.0 - https://phabricator.wikimedia.org/T209808 (10elukey) [07:13:41] 10Analytics, 10Analytics-Kanban: Upgrade Matomo to 3.6.1 or 3.7.0 - https://phabricator.wikimedia.org/T209808 (10elukey) [07:29:41] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1080.eqiad.wmnet'... [07:45:38] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on notebook1004 is OK: OK [08:01:21] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-96].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1080.eqiad.wmnet'] ` and were **ALL** successful. [08:09:39] Morning elukey [08:09:44] Bonjour! [08:09:55] Thanks for caring Yarn this weekend :) [08:10:32] super weird, and also 99% caused by me :D [08:10:56] Your explanation email was super clear :) [08:11:15] elukey: any idea about the notebook1004 disk-issue ? [08:11:51] which one? :D [08:11:56] Mwhahaha :) [08:12:14] jokes aside, the mnt/hdfs mountpoint is sadly very flaky [08:12:28] and I had to force the remount several times [08:12:45] early this morning people were complaining of disk-space (just reading the backlog) [08:13:26] I checked before and all is good [08:13:27] /dev/mapper/notebook1004--vg-data 136G 74G 56G 58% /srv [08:13:43] the main issue is that without any restriction people tend to overuse resources :D [08:13:57] of course [08:15:28] I am currently finishing the DHCP config for the new hadoop worker [08:15:35] after that we should be able to reimage [08:15:41] then prep the datanode partitions [08:15:48] \o/ ! Moar workers :) [08:15:49] and hopefully be ready to expand the cluster :) [08:15:58] Many thaks for that :) [08:17:28] also I am really worried about the banner impression stuff [08:17:39] they already announced the peak days [08:17:52] and no answer in the task yet :( [08:17:57] (already pinged them two times) [08:17:57] right :( [08:18:16] I'm pretty sure it'll backfire when they'll have pressure :( [08:18:49] exactly :( [08:19:00] elukey + joal: Hey both, i'm really sorry we haven't been able to get back to you. We have been snowed under with security tasks the last few weeks [08:19:39] Seddon: hi! [08:19:50] Hi Seddon :) [08:20:35] thanks for the feedback, my main concern is that a lot of work (from our side) is still needed and with short notice it might be difficult to deliver in time :( [08:28:05] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) [08:37:48] @joal + @elukey: We still need to do some clean up after a rushed lastminute friday night swat deployment but I'll touch base with dstrine and andyrussg later today their time and see if we can get back to you today SF time. I've caught up on the task and I suspect I know the answers but don't wanna preclude any decision from fr-tech [08:38:24] Sounds great Seddon - Thanks for stepping up on this :) [08:39:40] thanks! [08:41:25] * Seddon heads back to the coalface [08:42:44] Good luck Seddon - Beware of firedamps [08:58:01] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1081.eqiad.wmnet'... [09:16:21] 10Analytics, 10EventBus, 10Operations, 10WMF-JobQueue, and 4 others: Stop and remove old job runners - https://phabricator.wikimedia.org/T198220 (10jijiki) @Pchelolo/@mobrovac jobqueue_redis instances have been removed from prod and we have cleaned up any puppet and mediawiki-config references . Should we... [09:57:09] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1082.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1082.... [10:02:31] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1082.eqiad.wmnet'... [10:15:51] 10Analytics, 10Anti-Harassment: Add ipblocks_restrictions table to Data Lake - https://phabricator.wikimedia.org/T209549 (10Marostegui) >>! In T209549#4755134, @TBolliger wrote: > From some discussions, I believe we'll want to add `ipblocks_restrictions` to `filtered_tables.txt` so my team can compare monthly... [10:24:15] 10Analytics, 10Anti-Harassment: Add ipblocks_restrictions table to Data Lake - https://phabricator.wikimedia.org/T209549 (10dmaza) >>! In T209549#4757436, @Marostegui wrote: > > I don't really understand what `filtered_tables.txt` you guys expect to get from adding the table there. This file is only used to g... [10:25:34] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1082.eqiad.wmnet'] ` and were **ALL** successful. [10:27:32] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1083.eqiad.wmnet'... [10:33:27] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 3 others: [EPIC] Develop a JobQueue backend based on EventBus - https://phabricator.wikimedia.org/T157088 (10mobrovac) [10:33:34] 10Analytics, 10EventBus, 10Operations, 10WMF-JobQueue, and 3 others: Stop and remove old job runners - https://phabricator.wikimedia.org/T198220 (10mobrovac) 05Open>03Resolved Indeed @jijiki ! Thanks! [10:38:07] 10Analytics, 10Anti-Harassment: Add ipblocks_restrictions table to Data Lake - https://phabricator.wikimedia.org/T209549 (10Marostegui) >>! In T209549#4757445, @dmaza wrote: >>>! In T209549#4757436, @Marostegui wrote: >> >> I don't really understand what `filtered_tables.txt` you guys expect to get from addin... [10:44:34] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10elukey) @Cmjohnson the Debian OS install is in progress, but I think that an-worker109[45] have their network ports disabled. Can you... [11:00:34] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1084.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1084.... [11:13:27] 10Analytics, 10Anti-Harassment: Add ipblocks_restrictions table to Data Lake - https://phabricator.wikimedia.org/T209549 (10dmaza) >>! In T209549#4757473, @Marostegui wrote: > That is not correct. > Tables do not get exposed by default to prevent accidental data leaks. > Note that "exposed" is different from "... [11:28:22] 10Analytics: Add new wikis to analytics - https://phabricator.wikimedia.org/T209822 (10Urbanecm) [11:28:38] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1085.eqiad.wmnet'... [11:46:13] hey joal! permission to deploy aqs to production and add the two fields to the uniques keyspace? [11:47:14] fdans: fine to me, but please !log the change (and task ref) in #wikimedia-operations' SAL [11:47:26] and in here [11:47:29] so it is traceable [11:47:29] elukey: always! [11:47:39] then I think you can go :) [11:48:02] we could leave the code running for a sec on aqs1004 (IIRC the canary) and see if something horrible is returned [11:48:05] and rollback in case [11:48:28] about rollback: say that we realize that something horrible is happening, can we easily rollback the schema change? [11:48:51] (paranoid me asking, nothing more) [12:05:05] fdans: do you want to do it now or later? If later I'll go to lunch :) [12:05:35] wooo sorry luca I was installing docker [12:05:54] but we can do it later, I have to make lunch too [12:06:58] ack then :) [12:28:56] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1086.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1086.... [12:56:45] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1087.eqiad.wmnet'... [13:01:30] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1087.eqiad.wmnet'] ` and were **ALL** successful. [13:07:31] elukey: I'm back, whenever you want we can do this :) [13:07:57] I am as well [13:08:18] before proceeding - is rollback easy if needed? [13:09:08] elukey: we would remove the fields from the schema and bump up the version [13:10:08] all right [13:10:12] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1088.eqiad.wmnet'... [13:10:29] so first let's deploy to the canary (scap should do it and ask for confirmation before proceeding further) [13:10:39] then do all the appropriate checks [13:10:46] and proceed if we feel comfortable [13:10:48] elukey: ack, doing the docker thing right now [13:17:14] (03CR) 10Fdans: [V: 032 C: 032] Add offset and underestimate to uniques table schema [analytics/aqs] - 10https://gerrit.wikimedia.org/r/473708 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [13:19:02] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: mediawiki_history datasets have null user_text for IP edits - https://phabricator.wikimedia.org/T206883 (10JAllemandou) I hear your point and it makes a lot of sense. I think our views differ in the notion of //current name//. In my world a current... [13:21:32] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Hive query fails with local join - https://phabricator.wikimedia.org/T209536 (10JAllemandou) >>! In T209536#4754461, @Neil_P._Quinn_WMF wrote: > Hmm, the big uptick in errors I described above happened using HiveServer2. Do you mean that this happens on b... [13:24:30] (03PS1) 10Fdans: Update aqs to 05d72cf [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/474686 [13:25:40] ready to deploy with your permission elukey ^ [13:25:45] 10Analytics, 10Dumps-Generation, 10ORES, 10Scoring-platform-team, and 3 others: [Epic] Make ORES scores available in Hadoop and as a dump - https://phabricator.wikimedia.org/T209611 (10JAllemandou) When data gets stored in Hadoop, it is easy to supply pageviews-like dumps files. About how to compute the sc... [13:26:32] fdans: is https://gerrit.wikimedia.org/r/#/c/analytics/aqs/deploy/+/474686/1/src ok? [13:26:44] I mean, did you check that the sha is the one needed? [13:26:51] if so, feel free to go :) [13:27:14] elukey: yessir :) [13:27:24] please go then :) [13:27:36] remember to !log the schema change (linking the task) [13:27:54] !log deploying aqs to add new fields to uniques dataset (T167539) [13:27:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:27:58] T167539: Final steps to expose project family unique devices data - https://phabricator.wikimedia.org/T167539 [13:29:12] fdans: /me --pedantic (sorry) - next time I think it would be better to !log the schema change (if you are doing anything manually to cassandra) and then add a meaningful deploy message to scap [13:29:43] 10Analytics, 10ORES, 10Scoring-platform-team: Choose HDFS paths and partitioning for ORES scores - https://phabricator.wikimedia.org/T209731 (10JAllemandou) >>! In T209731#4754979, @Nuria wrote: > It is worth looking at already existing event data, if we want to reuse the logic that reads events and persists... [13:30:02] elukey: we're not doing anything manually to cassandra though, aqs takes care of the schema change upon its restart [13:30:27] ah sorry then I misunderstood the change, good :) [13:30:53] elukey: yeah in this deploy we're just adding the two fields by declaring them in the json schema in aqs [13:31:04] when aqs restarts it automatically adds them to cassandra [13:31:13] 10Analytics, 10ORES, 10Scoring-platform-team: Purge ORES scores from Hadoop and begin backfill when model version changes - https://phabricator.wikimedia.org/T209742 (10JAllemandou) Indeed in hadoop there is no such thing as 'in place'. The way to go could be to use model version as a partition-key. You'd ba... [13:31:15] ack [13:31:21] (03CR) 10Fdans: [V: 032 C: 032] Update aqs to 05d72cf [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/474686 (owner: 10Fdans) [13:33:40] elukey, fdans- I think it;s a bit tricker than that (reading the backlog) [13:34:03] elukey, fdans - Once aqs1004 get dpeloyed, table gets updated in cassandra - everywhere [13:34:44] sure but I thought that we were adding fields that the current version of aqs wouldn't care about [13:34:46] Maybe (and hopefully) those changes being backward comptible, other aqs nodes still work, but there is a risk [13:35:16] joal elukey do i have to sudo to deploy? I'm getting a permission denied [13:35:34] nope you just need to scap deploy [13:35:48] maybe you are not in the group [13:35:48] I assume restbase-mod-cassandra will nto fail when new fields are added, but I think I have never tested [13:36:29] yeah fdans you are not in deploy-aqs [13:37:05] joal: in testing, the only failures were related to me forgetting to bump up the version of the schema [13:37:27] joal: do you want to take over the deployment? [13:37:41] elukey: nope - just raising a point - [13:37:52] joal: nono read above, fran cannot deploy :D [13:38:02] Ah ! In case, yes :) [13:38:19] fdans: I'll open a task for you to be able to deploy next time, but since it needs sudo it'll take a bit [13:38:22] sorry [13:38:26] fdans, elukey - I was mentioning the point in case some failure happen, could be related [13:38:41] yep it was a valid and wise point [13:38:46] fdans: Shall I scap deploy aqs/deploy from deployment.eqiad.wmnet? [13:39:00] that would be awesome joal :) [13:39:12] * joal logs in dpeloyment [13:39:39] * joal failure - Tries again to deployment [13:40:14] joal: sorry, I didn't finish my scap session, just finished it :) [13:41:20] !log Deploying aqs using scap [13:41:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:44:12] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1089.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1089.... [13:45:31] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1084.eqiad.wmnet'... [13:47:02] fdans: it works but there are some unexpected returned values I think [13:47:28] fdans: curl -X GET --header 'Accept: application/json; charset=utf-8' 'http://aqs1004.eqiad.wmnet:7232/analytics.wikimedia.org/v1/unique-devices/en.wikipedia/all-sites/daily/20180101/20180201' [13:47:40] fdans: vs curl -X GET --header 'Accept: application/json; charset=utf-8' 'http://aqs1005.eqiad.wmnet:7232/analytics.wikimedia.org/v1/unique-devices/en.wikipedia/all-sites/daily/20180101/20180201' [13:48:25] fdans: on aqs1004 you get: "offset":null,"underestimate":null values [13:49:02] joal: I see [13:51:31] aaaaaaaa goddamn :( the uniques function in aqs returns the same response object as received from cassandra [13:52:06] fdans: Do you want me to rollback, of shall I move forward and redeplopy a patch later on? [13:52:20] sending new patch now joal, I'm sorry 御免なさい [13:53:07] the change is backwards compatible so it shouldn't break anything... whatever you think is better joal [13:53:56] elukey: any opinion ? --^ [13:58:10] joal: I am going to depool aqs1004, you can fail the deployment and then we wait for Fran's patch ok? [13:58:30] elukey: +1 [13:58:49] elukey: I'm a bit afraid of the rollback on cassandra, but it;s an interesting try :) [13:59:06] !log failing deployment on aqs to include a new patch [13:59:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:59:27] aqs1004 depooled [13:59:45] * joal is awakening on the fact that anything touching cassandra makes him afraid [14:11:10] ok - need to delete the fields from cassandra [14:11:53] there is a mismatch between the schema read from cassandra and the one loaded in conf (the 2 fields added), so initial check fails [14:11:54] fdans: are you working on the rollback? (just to confirm) [14:12:10] elukey: can we batcave for the prod patch plezae? [14:12:25] surez [14:12:38] elukey: I'm testing the new patch [14:13:10] elukey: other solution is to wait for the patch indeed [14:34:59] 10Analytics, 10Analytics-EventLogging: Resurrect eventlogging_EventError logging to in logstash - https://phabricator.wikimedia.org/T205437 (10Ottomata) Sam, that'd be great! Find me and Marcel (mforns) on IRC in #wikimedia-analytics and lets discuss. Actually...@fgiunchedi or @herron might have something he... [14:51:48] elukey, fdans - Will to drop soon to get the kids [14:52:22] (03PS1) 10Fdans: Remove underestimate and offset before sending uniques response [analytics/aqs] - 10https://gerrit.wikimedia.org/r/474701 (https://phabricator.wikimedia.org/T167539) [14:52:35] o/ [14:52:52] Hi milimetric [14:52:59] hey [14:53:09] ^ joal there's something wrong with my local aqs (it doesn't seem to be talking to cassandra) but I added a nice lil unit test to make sure the two fields aren't being passed [14:55:18] (03PS2) 10Rafidaslam: Add "/health/summary/v1/" API endpoint [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/474532 (https://phabricator.wikimedia.org/T205151) [15:03:04] (03PS3) 10Rafidaslam: Add "/health/summary/v1/" API endpoint [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/474532 (https://phabricator.wikimedia.org/T205151) [15:03:47] (03CR) 10Joal: [C: 031] "LGTM" [analytics/aqs] - 10https://gerrit.wikimedia.org/r/474701 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [15:04:12] fdans, milimetric --^ Need to go for kids - I let you mvoe with that, or I'll do it post standup [15:04:38] k [15:04:53] fdans: up to you, if you want to catch me up or wait till after standup [15:06:00] milimetric: batcave? [15:06:43] fdans: sure [15:08:42] (03CR) 10Milimetric: [C: 032] Remove underestimate and offset before sending uniques response [analytics/aqs] - 10https://gerrit.wikimedia.org/r/474701 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [15:08:45] (03CR) 10Milimetric: [V: 032 C: 032] Remove underestimate and offset before sending uniques response [analytics/aqs] - 10https://gerrit.wikimedia.org/r/474701 (https://phabricator.wikimedia.org/T167539) (owner: 10Fdans) [15:13:32] (03PS1) 10Fdans: Update aqs to 8909673 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/474705 [15:14:01] \o/ [15:14:55] (03PS2) 10Rafidaslam: app.py: Handle meta endpoints when the specified id doesn't exist [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/474530 (https://phabricator.wikimedia.org/T209783) [15:16:12] (03CR) 10Milimetric: [V: 032 C: 032] Update aqs to 8909673 [analytics/aqs/deploy] - 10https://gerrit.wikimedia.org/r/474705 (owner: 10Fdans) [15:18:24] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1089.eqiad.wmnet'... [15:18:33] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1089.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1089.... [15:38:39] RECOVERY - AQS root url on aqs1004 is OK: HTTP OK: HTTP/1.1 200 - 295 bytes in 0.004 second response time ? [15:38:52] fdans: have you guys done anything? [15:44:13] elukey: yesss we deployed all the things, making sure that the endpoint was ok in the canary [15:44:56] so ok for me to repool aqs1004 I guess [15:46:20] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1090.eqiad.wmnet'... [15:47:42] yes, thank you elukey [15:50:06] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-worker10[78-95].eqiad.wmnet - https://phabricator.wikimedia.org/T207192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1090.eqiad.wmnet'] ` and were **ALL** successful. [15:50:31] milimetric: do you have a few mins to brain bounce stream intake stuff before standup? [15:50:43] i think its a quick one, i might just need a rubber duck [15:51:33] omw cave ottomata [16:01:33] ping fdans [16:06:29] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Ottomata) Ping on this! I know it is TG week so things might be slow, but I'm checking in anyway :) [16:29:33] 10Analytics, 10Analytics-Kanban, 10EventBus, 10ORES, and 4 others: Modify revision-score schema so that model probabilities won't conflict - https://phabricator.wikimedia.org/T197000 (10Ottomata) We plan to deploy this Monday Nov 26th. [17:03:37] ping joal milimetric [17:04:28] hola ping [17:04:34] hola ping joal [17:14:12] 10Analytics: Add new wikis to analytics - https://phabricator.wikimedia.org/T209822 (10fdans) a:03fdans [17:16:55] 10Analytics, 10ORES, 10Scoring-platform-team: Wire ORES scoring events into Hadoop - https://phabricator.wikimedia.org/T209732 (10fdans) p:05Triage>03High [17:18:05] 10Analytics, 10Analytics-Kanban: Set up CI system on AQS - https://phabricator.wikimedia.org/T209711 (10fdans) p:05Triage>03Normal [17:18:32] 10Analytics, 10Research, 10Wikidata: Copy Wikidata dumps to HDFs - https://phabricator.wikimedia.org/T209655 (10fdans) p:05Triage>03Normal [17:18:48] 10Analytics, 10Research, 10Wikidata: Copy Wikidata dumps to HDFs - https://phabricator.wikimedia.org/T209655 (10fdans) p:05Normal>03Low [17:19:16] 10Analytics, 10Dumps-Generation, 10ORES, 10Scoring-platform-team, and 3 others: [Epic] Make ORES scores available in Hadoop and as a dump - https://phabricator.wikimedia.org/T209611 (10fdans) p:05Triage>03Normal [17:19:24] 10Analytics, 10Dumps-Generation, 10ORES, 10Scoring-platform-team, and 3 others: [Epic] Make ORES scores available in Hadoop and as a dump - https://phabricator.wikimedia.org/T209611 (10fdans) p:05Normal>03Triage [17:19:53] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Hive query fails with local join - https://phabricator.wikimedia.org/T209536 (10fdans) p:05Triage>03High [17:21:28] 10Analytics, 10Analytics-Kanban: Update EventLogging kafkacat examples to use jumbo - https://phabricator.wikimedia.org/T209635 (10fdans) p:05High>03Normal [17:28:08] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics: mediawiki_history datasets have null user_text for IP edits - https://phabricator.wikimedia.org/T206883 (10fdans) We won't be making any changes in mediawiki history in the near term since we're redefining the way we sqoop data. [17:34:15] 10Analytics-Kanban, 10User-Elukey: Q1 2018/19 Analytics procurement - https://phabricator.wikimedia.org/T198694 (10Cmjohnson) [17:37:44] 10Analytics, 10Analytics-Kanban: AQS unique devices api should report offset/underestimate separately - https://phabricator.wikimedia.org/T164201 (10fdans) a:03fdans [17:40:13] 10Analytics-Kanban, 10MediaWiki-General-or-Unknown, 10Tool-Pageviews: Check abnormal pageviews for some pages on itwiki - https://phabricator.wikimedia.org/T209404 (10fdans) We see this pattern of traffic often and it usually comes from bots. We'll investigate this further as part of our bot ID efforts. [17:40:34] 10Analytics, 10Analytics-Kanban, 10MediaWiki-General-or-Unknown, 10Tool-Pageviews: Check abnormal pageviews for some pages on itwiki - https://phabricator.wikimedia.org/T209404 (10fdans) [17:41:03] 10Analytics, 10MediaWiki-General-or-Unknown, 10Tool-Pageviews: Check abnormal pageviews for some pages on itwiki - https://phabricator.wikimedia.org/T209404 (10fdans) [17:41:38] 10Analytics, 10MediaWiki-General-or-Unknown, 10Tool-Pageviews: Check abnormal pageviews for some pages on itwiki - https://phabricator.wikimedia.org/T209404 (10fdans) p:05Triage>03Low [17:41:48] 10Analytics, 10MediaWiki-General-or-Unknown, 10Tool-Pageviews: Check abnormal pageviews for some pages on itwiki - https://phabricator.wikimedia.org/T209404 (10fdans) p:05Low>03Normal [17:42:04] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) [17:42:41] 10Analytics, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) [17:47:27] 10Analytics, 10Analytics-SWAP: heirloom-mailx fails trying to send out email from SWAP notebook - https://phabricator.wikimedia.org/T168103 (10fdans) [17:48:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clickstream dataset for Persian Wikipedia only includes external values - https://phabricator.wikimedia.org/T191964 (10Ladsgroup) >>! In T191964#4736568, @Nuria wrote: > Tested job but failed, looking: https://yarn.wikimedia.org/cluster/app/application_154... [18:00:12] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clickstream dataset for Persian Wikipedia only includes external values - https://phabricator.wikimedia.org/T191964 (10Nuria) @Ladsgroup i tested this and several other variations, none of which worked. You can git clone refinery depot on stats machines... [18:36:38] * elukey off! [20:35:14] (03CR) 10Framawiki: Add "/health/summary/v1/" API endpoint (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/474532 (https://phabricator.wikimedia.org/T205151) (owner: 10Rafidaslam) [20:55:09] ottomata: btw, did you decide on the request->hasty thing from this morning? [20:56:22] ya milimetric i don't fully get the value of another map function, but I talked with petr and we decided to go ahead and make add a context parameter. wanna BC and chat about it? [20:57:18] ottomata: this would replace all the other functions, so it'd be the only one, I can chat if you need but I'm happy with whatever you like best [20:57:47] let's chat, i think that wouldn't work easily, and is also similar to Petr's idea of an Event class that is extendable and would replace all the other functions [20:57:51] k [20:57:51] let's bc! [21:00:46] milimetric: added 2 comments to the doc - looks good :) [21:03:06] sweet, thx joal [21:03:30] I asked this earlier, but can anyone recommended strategy for mapping Wikipedia editions to countries where the language is commonly spoken? [21:04:24] milimetric: other idea - making utlra clear the 2 projects we are driving - short term solution (not satisfactory), [21:04:28] and long term one [21:08:36] joal: good point, yeah, I'm still very much working it out [21:08:38] I did mean to point that out [21:15:27] Gone for tonight team - see ou tomorrow [21:28:00] groceryheist: there are many mappings (as i am sure you know), like wikipedia in english is heavily used in the US but also in england [21:28:38] groceryheist: also dutch wikipedia being heavily used in places like Canada [21:29:37] groceryheist: so the "country on which this language is spoken the most" might not in all instances map to "the country with the most pageviews for project X". That works for some (japanese wikipedia comes to mind) but not all projects. [21:32:35] groceryheist: it is a field of study: https://blog.wikimedia.org/2017/10/27/new-interactive-visualization-wikipedia/ [21:36:26] nuria: Thanks for fixing the EventLogging MySQL DB on beta labs last week; it looks like it got stuck again though [21:37:13] RoanKattouw: argh, let me take a look, i think it comes to labs hardware not being able to deal with the volume, a problem with not a super easy solution [21:37:30] Yeah :| [21:38:23] I think there might be a task about this already, but it'd also be great to have Hadoop in labs. Right now Morten can't test pre-written data analysis queries there. I imagine that might run into more issues like this though, if keeping just a MySQL database is too much [21:38:47] Maybe the recent change where all schemas go into MySQL could have caused this? If so it's sort of my fault for asking for that :) [21:39:31] RoanKattouw: right, we have ahadoop lab cluster to test updates but labs coudl not deal with flow of data at all [21:39:52] RoanKattouw: on that do not worry, labs is had issues with inflow of events since day 1 [21:40:40] RoanKattouw: given that SAME code does not have any issues with 100X volume in prod i always think is due to VMS but i know labs env so little that it might be something else [21:41:15] RoanKattouw: give me a sec i am in teh middle of something but can look at this in 10 minutes [21:44:01] No worries, I'm about to grab lunch anyway [21:44:20] RoanKattouw: also , suspicious that "The last Puppet run was at Mon Nov 19 19:04:31 UTC 2018 (159 minutes ago)" , right? [21:45:18] Hmm not sure. That's longer ago than I'd expect but not spectacularly so, and none of the tables have been updated since the 17th [21:53:36] RoanKattouw: i see updates on teh 19th now [21:53:41] RoanKattouw: please take a look [22:05:58] 10Analytics: Alert and halt mediawiki processing on schema changes - https://phabricator.wikimedia.org/T209888 (10Milimetric) [22:10:41] OK yes I see newer events now. Let me trigger a PrefUpdate event and see if it comes through [22:11:18] Never mind beta is totally down <_< [22:15:54] woah, like beta itself, not just EL in beta [22:16:21] problems with beta to me seem normal, and just underscore how insanely awesome our SRE team is to keep this kind of stuff from happening in prod [22:22:23] (03PS2) 10Milimetric: [WIP] DO NOT MERGE testing speed of sqoop against labs with different approaches [analytics/refinery] - 10https://gerrit.wikimedia.org/r/473256 [22:28:30] Hey, I want to see logs of https://yarn.wikimedia.org/cluster/app/application_1542030691525_24459 but I can't find them. I even curl'd the logs link on stats machine but it was saying it's redirect [22:28:45] interesting I'm logged in as dr.who [23:06:35] Amir1: logs in hadoop get agreggated once the job is over [23:07:25] hmm, if you mean --output-base-path hdfs://analytics-hadoop/user/ladsgroup/ladsgroup-clickstream-test-2018-11-09 I couldn't find the file [23:07:52] https://yarn.wikimedia.org/logs/ [23:07:55] Amir1: no, logs are not accesible through UI [23:08:14] Amir1: there are many logs in haddop but the application logs are available like this: [23:08:27] yarn log a--plicationID [23:08:35] yarn log -applicationID [23:08:56] Amir1: once job is over, while job is running logs are distributed in 50 machines [23:09:02] oh thanks [23:09:37] Amir1: all this is on wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark [23:10:42] Amir1: i will get back to that task next week but will follow your progress until then [23:14:10] thanks [23:14:26] I'm learning [23:14:55] RoanKattouw: from my end looks like deployment-prep is all kaput [23:21:25] Yeah it's been having some serious problems today [23:21:45] (03CR) 10Rafidaslam: Add "/health/summary/v1/" API endpoint (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/474532 (https://phabricator.wikimedia.org/T205151) (owner: 10Rafidaslam) [23:22:42] RoanKattouw: ok, ping us tomorrow if you still have issues, i tailed logs for a bit and could see events coming in. [23:23:42] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Reading-analysis: Final Vetting of Family Wide unique devices data - https://phabricator.wikimedia.org/T169550 (10Nuria) @Tbayer: do you have some more comments related to vetting of this metric or is this the only one? [23:25:08] 10Analytics-Kanban, 10Patch-For-Review: EventLogging Hive Refine broken after upgrade to CDH 5.15.0 - https://phabricator.wikimedia.org/T209407 (10Nuria) 05Open>03Resolved [23:28:52] RoanKattouw: nuria WMCS people are doing neutron migration [23:29:10] there is an email in wikitech-l [23:31:16] superset down right now [23:31:23] or not [23:31:38] nevermind, was just hitching for a bit there [23:33:23] being really slow right now though for some reason [23:43:23] how does the data get collected? [23:53:08] Amir1: does https://wikitech.wikimedia.org/wiki/Analy`tics/Systems/Cluster/Hive/Queries#Search_through_logs help? [23:54:33] I guess, Piwik... [23:57:01] cthulchu: do you have a question? [23:57:27] yeah, I asked it a bit before.