[02:03:04] 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Milimetric) @Ijon I'm working on a blacklist, and wanted to check with you to see how it would impact the usefulness of the datase... [02:27:59] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Nuria) > It can be delayed for an arbitrary amount of time; events (with a timestamp and all their data) wi... [04:18:37] PROBLEM - Check the last execution of monitor_refine_mediawiki_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_mediawiki_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:14:37] good morning camus! [06:15:54] from the error logs it seems that the mapreduce job suffered a heap problem [06:16:12] so I have added manually a more generous setting and restarted [06:16:36] brb [06:22:07] worked, and refine is now running [06:22:10] let's see [06:35:00] RECOVERY - Check the last execution of monitor_refine_mediawiki_events on an-coord1001 is OK: OK: Status of the systemd unit monitor_refine_mediawiki_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:35:55] goooood [07:04:12] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Version analytics meta mysql database backup - https://phabricator.wikimedia.org/T231208 (10elukey) I am inclined to mark this as done given T234826 [08:11:42] awight: you broke my heart :( [08:11:56] (kidding, thanks a ton for the el testing!) [08:41:21] 10Analytics, 10Discovery, 10Event-Platform, 10Wikidata, and 3 others: Log Wikidata Query Service queries to the event gate infrastructure - https://phabricator.wikimedia.org/T101013 (10dcausse) @Ottomata absolutely this is for analysis purposes [08:43:53] 10Analytics, 10Discovery, 10Event-Platform, 10Wikidata, and 3 others: Log Wikidata Query Service queries to the event gate infrastructure - https://phabricator.wikimedia.org/T101013 (10dcausse) [08:48:23] 10Analytics, 10Discovery, 10Event-Platform, 10Wikidata, and 3 others: Log Wikidata Query Service queries to the event gate infrastructure - https://phabricator.wikimedia.org/T101013 (10dcausse) @Ottomata I updated the task description to indicate the steps needed to make this happen, I probably missed some... [09:34:28] hey, I have a quick question about wdqs_extract (don't worry I don't want to bring it back) [09:35:55] just in case you remember what it was doing, was it just a subset of the webrequest logs or did it have some advanced transformations that I could reuse to extract the sparql queries our wdqs hosts receive today [09:47:42] no idea dcausse :( [09:48:48] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) [10:17:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Version analytics meta mysql database backup - https://phabricator.wikimedia.org/T231208 (10elukey) @Ottomata do you think that it would be ok to remove the current HDFS backup (to remove one things that blocks the krb deployment) given T2... [10:49:02] * elukey lunch! [11:51:40] Hi team - I'm here while kids sleep and then will be back for standup [11:52:16] joal: helloooo good sir [12:21:24] elukey: which camus job failed? [12:21:41] ...wait i wil check email.... [13:23:55] dcausse: I'll check the git history and let you know [13:24:59] milimetric: thanks! but don't waste too much time on this, I'm close to have something working in swap notebook [13:25:38] dcausse: 2 minutes :) [13:25:48] :) [13:25:50] dcausse: just a straight copy: https://github.com/wikimedia/analytics-refinery/commit/bdd566ada8a797b5670208987cd53994775f8f87#diff-fae5910d5b2d0fa42bd76cd2ddb95c20 [13:25:54] no tranformations [13:26:34] milimetric: great thanks! [13:27:53] just discovered the tags field on webrequest, and thankfully you add a tag 'sparql' somewhere in the refinery process, this is super handy for my usecase [14:29:44] hive2druid working now in hadoop test! [14:56:39] cool! [15:10:28] dcausse: stas added that tag for i think your same use case [15:47:50] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: drop CitatitionUsage data on mysql - https://phabricator.wikimedia.org/T233893 (10Nuria) 05Open→03Resolved [15:47:52] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Nuria) [15:48:12] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10Nuria) 05Open→03Resolved [15:48:27] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), 10Services (watching): Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Nuria) [15:48:29] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 3 others: CI Support for Schema Registry - https://phabricator.wikimedia.org/T206814 (10Nuria) 05Open→03Resolved [15:48:40] 10Analytics, 10Analytics-Kanban: Add more dimensions to netflow's druid ingestion specs - https://phabricator.wikimedia.org/T229682 (10Nuria) 05Open→03Resolved [15:49:00] 10Analytics, 10Analytics-Kanban, 10Services (watching): Mediarequests: Add endpoint for agreggated counts per file type per project - https://phabricator.wikimedia.org/T231589 (10Nuria) 05Open→03Resolved [15:49:03] 10Analytics, 10Patch-For-Review, 10Services (watching): Add mediacounts data to AQS and, from there, Restbase - https://phabricator.wikimedia.org/T207208 (10Nuria) [15:49:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10Nuria) [15:49:20] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Move refinery to hive 2 actions - https://phabricator.wikimedia.org/T227257 (10Nuria) 05Open→03Resolved [15:49:32] 10Analytics-Kanban: Deprecate Python 2 software from the Analytics infrastructure - https://phabricator.wikimedia.org/T204734 (10Nuria) [15:49:34] 10Analytics, 10Analytics-Kanban: Verify what Python 2 packages deployed to Analytics hosts are needed - https://phabricator.wikimedia.org/T204737 (10Nuria) 05Open→03Resolved [15:49:51] 10Analytics, 10Analytics-Kanban: Move Analytics Report Updater to Python 3 - https://phabricator.wikimedia.org/T204736 (10Nuria) 05Open→03Resolved [15:50:10] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Scoring-platform-team, 10Patch-For-Review: Change event.mediawiki_revision_score schema to use map types - https://phabricator.wikimedia.org/T225211 (10Nuria) 05Open→03Resolved [15:50:30] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10Nuria) 05Open→03Resolved [15:50:32] 10Analytics-Kanban: Deprecate Python 2 software from the Analytics infrastructure - https://phabricator.wikimedia.org/T204734 (10Nuria) [15:51:24] 10Analytics, 10Analytics-Kanban, 10Event-Platform: Clean up descriptions of fields in included common schemas in mediawiki/event-schemas repository - https://phabricator.wikimedia.org/T233057 (10Nuria) 05Open→03Resolved [15:51:26] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), 10Services (watching): Modern Event Platform: Schema Registry: Implementation - https://phabricator.wikimedia.org/T206789 (10Nuria) [15:53:18] (03CR) 10Nuria: Add spark job to generate a data quality report (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541557 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [15:59:48] (03CR) 10Nuria: [C: 03+2] "Let's merge this code if we have tested the job." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538613 (https://phabricator.wikimedia.org/T233504) (owner: 10Joal) [16:10:56] (03PS2) 10Fdans: (wip) Add backfill queries for mediarequest metrics [analytics/refinery] - 10https://gerrit.wikimedia.org/r/541817 [16:33:35] milimetric: o/ [16:33:37] sorry elukey was running to other meeting [16:33:39] elukey: I joined back, but you left just before :) [16:33:44] you were gonna say about the aqs patch? [16:33:46] aahaha sorry guys [16:34:13] milimetric: yes merged and puppet run on the aqs host, we can restart aqs on aqs1004 and test when you have time [16:34:18] (the apply to all) [16:34:33] depool 1004, restart aqs, test, repool, apply all [16:34:42] usually this is what I do with Joseph [16:35:00] yep, I can test now elukey [16:35:13] ah ok lemme depool 1004 then [16:36:03] milimetric: 1004 depooled and ready [16:37:20] 10Analytics, 10Desktop Improvements, 10Event-Platform, 10Readers-Web-Backlog (Kanbanana-2019-20-Q2): [SPIKE 8hrs] How will the changes to eventlogging affect desktop improvements - https://phabricator.wikimedia.org/T233824 (10ovasileva) [16:37:57] 10Analytics: dumps.wikimedia.org/other/mediawiki_history is missing some files - https://phabricator.wikimedia.org/T235112 (10mforns) [16:38:11] elukey: uh... weird... not getting data, hang on, gotta reboot my brain [16:39:49] elukey: yeah, confirmed, aqs1004 is broken somehow, aqs1005 is fine [16:40:17] proof: this returns no results: curl http://localhost:7232/analytics.wikimedia.org/v1/edits/aggregate/all-projects/all-editor-types/all-page-types/monthly/2017060100/2019100500 [16:40:24] like none at all, not even the older ones [16:49:54] ouch [16:54:12] milimetric: mmm with curl -X GET --header 'Accept: application/json; charset=utf-8' works? [16:55:05] I just tunneled and curled Druid directly and mediawiki_history_reduced_2019_09 works fine from druid1004 [16:55:06] ah no zero results for that [16:55:25] but not fine from druid1005! [16:55:30] which is ... weird [16:56:02] what you mean not fine? [16:56:05] elukey: steps to repro: [16:56:11] https://www.irccloud.com/pastebin/FGQiX9pn/ [16:56:25] this works if I'm doing: [16:56:25] ssh -N druid1004.eqiad.wmnet -L 8082:druid-public-broker.svc.eqiad.wmnet:8082 [16:56:30] anyway, with your curl if you swap localhost with aqs1004 workds [16:56:30] and it doesn't work if I'm doing: [16:56:33] ssh -N druid1005.eqiad.wmnet -L 8082:druid-public-broker.svc.eqiad.wmnet:8082 [16:57:35] ok, 'cause it's rebalancing me (really silly how it does that), but above is still weird, no? [16:59:13] I am trying to repro now [16:59:56] yeah, there's something still weird going on with the druid cluster, like it somehow hasn't distributed that datasource yet. Because this doesn't work: [16:59:56] curl http://aqs1004.eqiad.wmnet:7232/analytics.wikimedia.org/v1/editors/aggregate/en.wikipedia.org/all-editor-types/all-page-types/all-activity-levels/monthly/2017090100/2019100900 [17:02:40] (but it does work with aqs1005 as you'd expect it to) [17:02:42] the curl in the irccloud yields me and error for unclosed braces or similar [17:06:37] ok I can repro, druid1005 seems werid [17:11:30] !log restart druid-broker on druid100[5-6] - not serving data correctly [17:11:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:11:57] milimetric: now the work [17:11:59] super weird [17:12:51] indeed, the druids seem to have agreed to collaborate :) [17:13:19] hammer ftw :) [17:13:31] deploy away, best not to ask too many questions [17:13:38] (said Luca ... never) [17:13:39] I don't like this though [17:13:52] it seems a druid broker bug [17:13:59] like they were stuck in some state [17:14:01] sigh [17:14:02] sorry I'm in meeting, yeah, I'd be happy to brainbounce after [17:14:45] milimetric: ok to complete the aqs roll restart then? [17:15:40] just repooled aqs1004, looks good [17:15:48] elukey: I'm out of the meeting if you want to think about it more [17:15:56] but it seemed like a hiccup where that datasource wasn't replicated [17:16:03] agree that it's weird and that druid shouldn't do that [17:17:52] milimetric: I think that the historicals all had the data correctly in place (otherwise we wouldn't have seen such a quick recovery) but the brokers were stuck in some weird state [17:18:22] now I am wondering if they were returning empty responses to clients when hitting 1005/1006 [17:18:24] makes sense. Then yeah, roll deploy and we’ll keep an eye on it [17:18:38] I can always roll back with andrew if it keeps being weird [17:19:07] it was just the _09 datasource elukey, the _08 one wasn’t affected [17:19:22] so the weirdness never surfaced to the public [17:19:37] ah ok better :) [17:25:48] 10Analytics, 10MinervaNeue, 10Readers-Web-Backlog (Kanbanana-2019-20-Q2): MinervaClientError sends malformed events - https://phabricator.wikimedia.org/T234344 (10Krinkle) @Jdrewniak If our statsv client is producing request urls with multiple query strings, that's a bug indeed. Feel free to use this task fo... [17:27:15] milimetric: all done [17:27:24] 10Analytics, 10MinervaNeue, 10Performance-Team (Radar), 10Readers-Web-Backlog (Kanbanana-2019-20-Q2): MinervaClientError sends malformed events - https://phabricator.wikimedia.org/T234344 (10Krinkle) [17:27:59] kk, will test [17:28:39] looks great, thanks elukey [17:30:33] I think that next Q we should upgrade Druid [17:30:45] with the hope of a more stable thing [17:31:32] will triple check later, going to dinner now! [17:31:33] o/ [17:47:25] 10Analytics, 10Analytics-EventLogging, 10QuickSurveys, 10Readers-Web-Backlog (Kanbanana-2019-20-Q2): QuickSurveys EventLogging missing ~10% of interactions - https://phabricator.wikimedia.org/T220627 (10Jdlrobson) [18:02:52] milimetric, joal: is there any scooping that happens from labs db tht ais not mw history? (we scoop geoeditors from prod so i cannot think of anything else) [18:03:42] nuria: yeah tables like pagelinks that folks have requested [18:03:56] joal, I'm seeing that Hive.py does not support partition values that have dots (.) in them, is that needed for security reasons? or could it change? [18:04:05] milimetric: and those we scoop monthly correct? [18:04:07] but I don’t think we process and publish anything else that we sqoop [18:04:21] yes, all sqoops are monthly [18:04:28] k [18:27:12] 10Analytics, 10Cloud-Services, 10Developer-Advocacy (Oct-Dec 2019): Develop a tool or integrate feature in existing one to visualize WMCS edits data - https://phabricator.wikimedia.org/T226663 (10Milimetric) >>! In T226663#5542873, @bd808 wrote: > @Milimetric I don't quite understand what happened here, so... [18:31:42] (03CR) 10Milimetric: [V: 03+2] Add network-origin to the geoeditors-daily table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538613 (https://phabricator.wikimedia.org/T233504) (owner: 10Joal) [18:31:52] mforns_: want to look at jupyter if it is not too late? [18:32:27] nuria, sure! [18:32:31] batcave? [18:33:03] mforns_: yessir [18:40:21] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Jul-Sep 2019), 10Patch-For-Review: add whether an edit happened on cloud VPS to geoeditors-daily dataset - https://phabricator.wikimedia.org/T233504 (10Milimetric) The column has been added and I'm restarting the job so it will be f... [18:42:44] mforns_: anything left to deploy on https://phabricator.wikimedia.org/T223414? should I just move it to Done? [18:44:56] a-team: gonna deploy refinery source and refinery now, don't see much on the deployment etherpad so ping me if you need to add somethin [18:45:53] k! [18:46:03] milimetric: what I merged doesn’t need any actionables right now, deploying is enough [18:46:13] k [18:50:45] (03PS1) 10Milimetric: Update changelog for 0.0.102 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541892 [18:51:00] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update changelog for 0.0.102 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541892 (owner: 10Milimetric) [18:56:58] oh ebernhardson is that supposed to be queue inside queue? [18:57:01] that's not a thing, is it? [18:59:35] anyway, i 'fixed' it [18:59:38] fifo [18:59:39] oops [18:59:43] https://gerrit.wikimedia.org/r/c/operations/puppet/+/541895 [18:59:57] i think yarn resourcemanager needs to be restarted for that to take affect [19:00:04] i can do it if/when you need it, or we can wait until next time it happens [19:00:09] for maintenance reasons [19:10:58] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Jul-Sep 2019), 10Patch-For-Review: add whether an edit happened on cloud VPS to geoeditors-daily dataset - https://phabricator.wikimedia.org/T233504 (10Nuria) I do not think that is needed as #cloud-services team has that data from... [19:41:29] ottomata: scap deploy failed on stat1007 canary: [19:41:33] https://www.irccloud.com/pastebin/c8bkABsR/ [19:41:48] yeahhhhh [19:42:02] hm oh this isn't merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/541775 [19:42:06] so not the problem [19:42:39] you know milimetric, that changed made me realize, the git fat sha symlink to the jar in archiva is not immediate. [19:42:48] it is created by a cron that runs every 5 minutes [19:42:52] 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Make aggregate data on editors per country per wiki publicly available - https://phabricator.wikimedia.org/T131280 (10Ijon) Fine by me. [19:43:08] maybe if you attempt to deploy too soon after the refinery-source release happens [19:43:18] it'll fail. [19:43:18] hm [19:43:29] ottomata: are you telling me to wait 5 minutes and try again? :P [19:43:38] possibly! [19:43:51] ok, then this is what I will do [19:45:18] (I was just kidding, what you say makes sense, but I want to try unplugging and plugging something in after this) [19:47:18] ya milimetric i htink i should fix up that git-fat link script [19:47:27] it is too heavy with so much stuff in archiva [19:47:43] it computes shasums of every artifact in archiva every 5 minutes [19:47:52] so, i can see the process running now [19:47:56] started 3 mins ago, not done yet. [19:48:03] canary worked this time but it failed on all the other targets it looks like [19:48:39] it worked? seems surprising i don't see that sha yet [19:54:33] it just finished failing again, I'm trying everything again, but yeah, the second time the canary worked really fast... which seems weird [19:55:16] ottomata: really weird, this time it just flew though the scap deploy and finished [19:55:38] !log refinery ... probably? deployed with errors like "No such file or directory (2)\nrsync error" [19:55:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:56:45] uhhh hm [19:56:50] ottomata: I did something wrong... the refinery commit says vv0.0.102 [19:56:56] but I removed that when I ran the build! [19:57:03] (the leading v) [19:57:23] ya milimetric somethign is wrong i think [19:57:35] artifacts/org/wikimedia/analytics/refinery/refinery-cassandra-v0.0.102.jar [19:57:47] all the 102 jars are v0.0.102 [19:57:50] and their shas are wrong [19:57:51] da39a3ee5e6b4b0d3255bfef95601890afd80709 [19:57:55] each has the same [19:58:25] arghhh [19:59:15] hm is wikitech down for you? [19:59:51] ah is back. [20:00:58] milimetric: assuming the previous version value was bad, you might be able to just run https://integration.wikimedia.org/ci/job/analytics-refinery-update-jars/build again [20:01:02] with the 'v'-less version [20:01:41] I swear I didn't type a v in that stupid box, just like last time this happened [20:02:29] ok, I'll run that and run scap again after... I guess [20:02:33] ya [20:06:03] PROBLEM - Check the last execution of refine_sanitize_eventlogging_analytics_immediate on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:06:21] PROBLEM - Check the last execution of eventlogging_to_druid_netflow_hourly on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:08:49] PROBLEM - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:13:58] ottomata: same thing this time as far as scap is concerned - takes a while and fails, then if you rerun it finishes right away [20:14:12] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Ottomata) > We should probably not ask mobile clients to download a bunch of kilobytes that they will proba... [20:14:30] milimetric: that makes sense i think; git fat pull isn't happening the second time around [20:14:54] the git pull of the refinery repo works; git fat shas are updated, git fat fails. [20:14:55] yeah, but it doesn't matter how long I wait before running the first time, it still fails [20:15:01] next scap, refinery git doesnt change versions [20:15:03] so no git fat is run [20:15:05] right [20:15:30] ok the shas look better [20:15:32] let me check in archiva [20:15:52] milimetric: what is the git fat failure? [20:16:10] same as what I pasted above, the no such file or directory [20:16:28] ya but which sha? [20:16:45] (that will tell me which file) [20:16:55] i checked a few of the ones you just added [20:16:59] they exist in archiva; [20:17:51] hm, now refinery-deploy-to-hdfs fails [20:18:25] oh sorry I closed that terminal [20:19:03] https://www.irccloud.com/pastebin/dPYihif2/ [20:20:15] milimetric: makes sense [20:20:16] git fat failed [20:20:23] but we need to know which files failed. [20:20:28] i will try git fat pull on deploy host [20:21:03] /git-fat/da39a3ee5e6b4b0d3255bfef95601890afd80709 [20:21:04] oh [20:21:04] milimetric: [20:21:17] this is because the vXXX files that were around from before are still there [20:21:20] we need to remove them manually [20:21:23] willd oo [20:21:52] thanks, appreciate it [20:23:14] (03PS1) 10Ottomata: Removing bad jars accidentally added by jenkins during release [analytics/refinery] - 10https://gerrit.wikimedia.org/r/541909 [20:23:31] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Removing bad jars accidentally added by jenkins during release [analytics/refinery] - 10https://gerrit.wikimedia.org/r/541909 (owner: 10Ottomata) [20:23:34] PROBLEM - Check the last execution of refine_mediawiki_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_mediawiki_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:23:40] will look ^ [20:24:30] ? [20:24:43] jobs are failing beacuse of missing jars? [20:24:50] but....thhe old versions? [20:25:27] dunno what's up there, expect those to be fixed afte rthis scap deploy [20:27:56] ottomata: k [20:27:59] maybe something's referencing current or something? [20:28:08] maybe? but not all of those. [20:28:20] milimetric: my scap deploy is hanging at [20:28:20] analytics/refinery: fetch stage(s): 14% (ok: 1; fail: 0; left: 6) [20:28:31] canayr was succesful [20:28:38] oh [20:28:40] it just moved [20:28:41] ok [20:28:44] probably taking a while to pull jars? [20:30:21] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10Event-Platform, and 4 others: Modern Event Platform: Stream Configuration: Implementation - https://phabricator.wikimedia.org/T233634 (10Ottomata) Ok, https://github.com/ottomata/mediawiki-extensions-ConfigExports now supports filtering on conf... [20:31:37] yeah, when it works it does take a really long time [20:32:55] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10ORES, and 5 others: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180 (10Ottomata) FYI, we fixed the hairy problems by supporting map type fields, which the revisions-score stream uses! mediawik... [20:33:00] PROBLEM - Check the last execution of refine_eventlogging_analytics on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:33:00] PROBLEM - Check the last execution of refine_mediawiki_job_events on an-coord1001 is CRITICAL: CRITICAL: Status of the systemd unit refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:33:44] 10Analytics, 10Event-Platform, 10Core Platform Team Legacy (Watching / External), 10Services (watching): log-events topic emitted in EventBus - https://phabricator.wikimedia.org/T155804 (10Ottomata) 05Open→03Declined Don't think this will go anywhere, declining! Feel free to reopen. [20:34:47] 1 left... [20:35:44] 10Analytics, 10Event-Platform, 10CPT Initiatives (Modern Event Platform (TEC2)), 10Patch-For-Review, 10Services (next): mediawiki/recentchange event should not use fields with polymorphic types - https://phabricator.wikimedia.org/T216567 (10Ottomata) 05Open→03Declined [20:35:49] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 3 others: CI Support for Schema Registry - https://phabricator.wikimedia.org/T206814 (10Ottomata) [20:36:38] 10Analytics, 10Event-Platform, 10Core Platform Team Legacy (Watching / External), 10Services (watching): Failure in EventBus schema for mediawiki/revision/visibility-change - https://phabricator.wikimedia.org/T187362 (10Ottomata) 05Open→03Stalled [20:41:46] milimetric: still doing the scap deploy? [20:42:13] nuria: Andrew was fixing it, a few minutes ago he said one host was left [20:43:38] ottomata: thanks for updating yarn, i'll try it out today [20:44:08] ebernhardson: i haven't restarted resourcemanager [20:44:19] nuria: yeah it is still doing one host [20:44:21] not sure which one [20:44:32] milimetric: you should be able to run deploy to hdfs thoo [20:44:34] go ahead and do that [20:44:38] from stat1007 i guess? [20:44:47] ottomata: pushing ? [20:44:48] ok [20:44:58] nuria: i think it is git fat pulling on some target node [20:45:03] analytics/refinery: fetch stage(s): 85% (ok: 6; fail: 0; left: 1) [20:45:04] yeah, it's working now [20:45:43] ottomata: hmm, yarn.wikimedia.org reports the queue, not sure [20:46:17] oh [20:46:20] then great [20:46:21] no restart needed [20:46:23] go ahead [20:46:26] ok :) [20:52:54] !log deploy of refinery and refinery-source 0.0.102 finally seems to have finished [20:52:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:53:10] ottomata: if that other host is done, then nuria you can go ahead and bump up the version and restart [20:53:36] nuria: I wanted to do that but I'm in a meeting now, I can do it in a couple of hours [20:54:13] milimetric: let's see that alarms go to ok before doing anything, i can bump jar and backfill after that [20:54:49] ok, I'll do the geoeditors_daily job [20:54:58] (also when alarms are ok) [20:57:56] milimetric: right, i do not understand why did alarms triggered (cc ottomata ) for jobs like refine that have the older jars there [20:58:04] milimetric: do you understand that? [20:58:11] not at all [20:58:29] there's a whole layer of magic around how git fat works and all I know is that sometimes it doesn't and everything explodes [20:59:22] nuria: i dont' know why, i think perhaps that when the git fat pull failed, the old gitfat jars must have been left in a bad state? [20:59:25] but i am not totally sure. [21:00:21] ya, on an-coord right now even old jars don't have real content [21:00:29] ottomata: whatata? [21:00:59] i think it is in the middle of scap deploy [21:01:04] that's the remaining host [21:01:09] dunno whwy its taking so long... [21:01:11] ottomata: i do not understand, git fat shoudl pull with a sha just the new jars , right? [21:01:56] not 100% sure with scap, because scap does some symlink swapping when it deploys new versions [21:02:04] so it might have to re-pull each git fat jar every time [21:02:05] not sure. [21:02:18] going to ctrl-c the scap deploy and try an-coord again [21:03:03] ok looks good now [21:03:09] dunno why it didn't finish [21:03:16] but i see real jar files now [21:03:23] so, expecting recovery for next set of refines schedule [21:09:27] ottomata: k [21:25:57] milimetric: the changelog is missing the addition of the new column to geoeditors daily , we cannot change it now as it is pushed but FYI [21:26:02] RECOVERY - Check the last execution of refine_mediawiki_job_events on an-coord1001 is OK: OK: Status of the systemd unit refine_mediawiki_job_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:26:42] nuria: the column was just added to refinery not refinery-source [21:27:00] RECOVERY - Check the last execution of refine_mediawiki_events on an-coord1001 is OK: OK: Status of the systemd unit refine_mediawiki_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:35:29] milimetric: ah yes, that's right [21:36:40] RECOVERY - Check the last execution of refine_eventlogging_analytics on an-coord1001 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [21:37:08] gotta run byeyaa [21:41:27] milimetric: ahem , given that nobody is around to merge the jar bump for refine: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/541929/ I will backfill tomorrow [21:41:54] ok nuria, sounds good [21:56:20] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Jul-Sep 2019), 10Patch-For-Review: add whether an edit happened on cloud VPS to geoeditors-daily dataset - https://phabricator.wikimedia.org/T233504 (10Milimetric) ok, I restarted the monthly job and this column will be populated go... [22:04:32] RECOVERY - Check the last execution of eventlogging_to_druid_navigationtiming_hourly on an-coord1001 is OK: OK: Status of the systemd unit eventlogging_to_druid_navigationtiming_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:05:40] (03PS1) 10MNeisler: Add the MobileWebUIActionsTracking schema to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/541946 (https://phabricator.wikimedia.org/T234563) [22:06:00] RECOVERY - Check the last execution of eventlogging_to_druid_netflow_hourly on an-coord1001 is OK: OK: Status of the systemd unit eventlogging_to_druid_netflow_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:11:54] RECOVERY - Check the last execution of refine_sanitize_eventlogging_analytics_immediate on an-coord1001 is OK: OK: Status of the systemd unit refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:26:37] (03CR) 10Nuria: [C: 04-1] Add the MobileWebUIActionsTracking schema to EventLogging whitelist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/541946 (https://phabricator.wikimedia.org/T234563) (owner: 10MNeisler) [23:11:23] 10Analytics, 10Analytics-Kanban: Enable geoeditors_daily deletion - https://phabricator.wikimedia.org/T234238 (10Nuria) [23:12:28] 10Analytics: dumps.wikimedia.org/other/mediawiki_history is missing some files - https://phabricator.wikimedia.org/T235112 (10Nuria) a:03Ottomata [23:12:42] 10Analytics: dumps.wikimedia.org/other/mediawiki_history is missing some files - https://phabricator.wikimedia.org/T235112 (10Nuria) I think @Ottomata did this sync by hand [23:15:52] 10Analytics, 10Research: Taxonomy of new user reading patterns - https://phabricator.wikimedia.org/T234188 (10Nuria) @MGerlach Logged in users have a different pattern through the site, it will be worth checking that all his requests are served via varnish, if they are not (which might be the case) you have an... [23:19:11] 10Analytics, 10Research: Taxonomy of new user reading patterns - https://phabricator.wikimedia.org/T234188 (10Nuria) Ok, so i confirmed the data will be indeed there for all logged in users, just pageview times will be longer on repeated pageviews cause those pages are not cached [23:23:13] 10Analytics, 10Analytics-Kanban: Superset not able to load a reading dashboard - https://phabricator.wikimedia.org/T234684 (10Nuria) Pinging @JAllemandou in case he has another ideas [23:24:58] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging - https://phabricator.wikimedia.org/T159170 (10Nuria) [23:25:43] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Drop page create event data on mysql - https://phabricator.wikimedia.org/T233892 (10Nuria) 05Open→03Resolved [23:25:45] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban: Archive data on eventlogging MySQL to analytics replica before decomisioning - https://phabricator.wikimedia.org/T231858 (10Nuria) [23:27:45] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Coarse alarm on data quality for refined data based on entrophy calculations - https://phabricator.wikimedia.org/T215863 (10Nuria) p:05Normal→03High [23:28:35] 10Analytics, 10Research: Recommend the best format to release public data lake as a dump - https://phabricator.wikimedia.org/T224459 (10Nuria) 05Open→03Resolved [23:28:38] 10Analytics, 10Analytics-Kanban, 10Research-Backlog, 10Patch-For-Review: Release edit data lake data as a public json dump /mysql dump, other? - https://phabricator.wikimedia.org/T208612 (10Nuria) [23:29:43] 10Analytics, 10Operations, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) Ping @bblack to give us some priorities around this work [23:32:15] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Oct-Dec 2019): Explore importing geoeditors_daily data (aggregated edits per namespace per country per wiki) into druid - https://phabricator.wikimedia.org/T234281 (10Nuria) 05Open→03Declined [23:32:18] 10Analytics, 10Cloud-Services, 10Developer-Advocacy (Oct-Dec 2019): Develop a tool or integrate feature in existing one to visualize WMCS edits data - https://phabricator.wikimedia.org/T226663 (10Nuria) [23:34:25] 10Analytics: Use virtual image views to filter mediacounts - https://phabricator.wikimedia.org/T211030 (10Nuria) Pinging @Tgr but i think the logic to report mediacounts to varnish endpoint was dismantled at some point. [23:36:21] 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria) Declining, we decided a while back these endpoints are for tagged "user" traffic [23:36:30] 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria) 05Open→03Resolved [23:36:39] 10Analytics: add agent-type dimension to pageviews per country endpoint - https://phabricator.wikimedia.org/T233238 (10Nuria) 05Resolved→03Declined [23:39:28] 10Analytics, 10Operations, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria) a:03JAllemandou [23:39:51] 10Analytics, 10Analytics-Kanban, 10Operations, 10Traffic, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10Nuria)