[00:25:27] PROBLEM - Check the last execution of reportupdater-interlanguage on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:31:47] PROBLEM - Check the last execution of refinery-import-page-history-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:33:55] PROBLEM - Check the last execution of reportupdater-browser on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:35:41] RECOVERY - Check the last execution of reportupdater-interlanguage on stat1007 is OK: OK: Status of the systemd unit reportupdater-interlanguage [00:41:59] RECOVERY - Check the last execution of refinery-import-page-history-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps [00:44:11] RECOVERY - Check the last execution of reportupdater-browser on stat1007 is OK: OK: Status of the systemd unit reportupdater-browser [02:16:46] ah, thanks nuria, didn't know about this section, I had updated the one it was linking to [03:56:05] 10Analytics, 10EventBus, 10WMF-JobQueue, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: EventBus error "Unable to deliver all events: (curl error: 28) Timeout was reached" - https://phabricator.wikimedia.org/T204183 (10Krinkle) [04:20:38] 10Analytics, 10EventBus, 10WMF-JobQueue, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: EventBus error "Unable to deliver all events: (curl error: 28) Timeout was reached" - https://phabricator.wikimedia.org/T204183 (10Krinkle) Still seen currently, but will r... [06:10:23] 10Analytics, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [06:13:14] 10Analytics, 10Analytics-Kanban, 10Operations, 10Product-Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) Another crash happened last night ` Thread pointer: 0x0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you... [06:34:32] 10Analytics, 10Analytics-Kanban, 10Operations, 10Product-Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) All the replication threads but x1 started fine. I have fixed all the x1 rows that failed and it has now caught up [06:35:20] 10Analytics, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [08:03:32] morning! [08:03:39] all refine spark jobs moved to timers [08:08:40] the more I check the more I found crons [08:08:54] it seems a never ending bucket [08:08:54] :D [08:10:44] elukey: it's me adding them at night after you;re gone ;) [08:10:50] o/ elukey [08:12:26] ahhhh this is why! Bon jour! :) [08:15:08] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Refactor analytics cronjobs to alarm on failure reliably - https://phabricator.wikimedia.org/T172532 (10elukey) [08:17:40] joal: I am now moving eventlogging_to_druid_job to timers as well [08:17:45] since it is using the refine stuf [08:17:57] +1 [08:18:24] (03CR) 10Joal: Update mediawiki-history comment and actor joins (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [08:18:39] and then there is also profile::analytics::refinery::job::project_namespace_map [08:18:53] Wow - forgot about this one [08:18:57] it's an important one ! [08:21:11] 10Analytics, 10Analytics-Kanban, 10DBA, 10Patch-For-Review, 10User-Banyek: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] - https://phabricator.wikimedia.org/T210478 (10Marostegui) [08:21:54] 10Analytics-Kanban, 10Patch-For-Review: Coordinate work on minor changes for Edit Data Quality - https://phabricator.wikimedia.org/T213603 (10JAllemandou) [08:34:59] joal: to double check - the puppet comments say it is a weekly job, but it seems monthly? [08:35:43] elukey: we need it monthly - It has been running weekly for some time but might have changed [08:35:49] elukey: monthly is good IMO [08:37:15] super [08:44:13] ok ready to move it to timers [08:45:34] there is also refinery-drop-query-clicks [08:45:37] from discovery [08:45:45] interesting use case, since we have a mailto for them [08:48:40] anyway, this is the last cron in the hdfs' crontab on an-coord1001 [08:48:47] all the rest is timer-based [08:48:50] \o/ [08:49:33] Today's persistence token is awarded to elukey for its unstoppable killing of cron jobs :) [08:56:58] ah! [08:57:07] I found on stat1006 a old discovery report updater job [08:57:08] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/380778/ [08:57:19] apparently me and mforns did not clean it up properly! [08:57:20] buuuuu [08:57:44] elukey: cron jobs are sneaky beasts - they hide in every users space !! [08:58:01] the remaining crons are mostly rsyncs [08:58:07] that in theory are ok [08:58:11] what do you think? [08:58:25] You know me - I don't think :-P [08:58:30] yeah sure [08:58:33] maybe the opposite :D [08:58:42] you have thoughts also for the thing that we miss :D [08:59:05] !log clean up reportupdater_discovery-stats-interactive from stat1006 - old job not cleaned up [08:59:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:00:11] elukey: I have no objection keeping crons for rsync - But I wonder why [09:00:46] joal: we can move everything in theory, it seemed a bit overkill at first since we usually don't disable them etc.. [09:01:35] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Refactor analytics cronjobs to alarm on failure reliably - https://phabricator.wikimedia.org/T172532 (10elukey) [09:01:40] elukey: only reason I can think of is alarms in failure case [09:02:07] elukey: but failure is not an option here, as it should fix itself at next run [09:02:18] the main issue is that rsyncs sometimes break for stupid reasons, and we might end up getting garbage [09:02:21] mmmm [09:03:11] ok I'd say to be done for this round of changes, and then to revise rsyncs etc.. for next quarter [09:03:42] \o/ elukey :) [09:04:28] elukey: thanks for making goals not even after one third of the quarter :) It makes me feel we might be able to make more of them ;) [09:11:17] :D [09:12:09] I'd really love to find time to make the gpu on stat1005 to work [09:19:42] elukey: stat1005 is currently free of users, right? let's start with buster on that, it'll really profit from a graphics stack which is two years more recent than stretch [09:20:07] we're still sorting out puppet for buster, but it should not be too far away [09:20:57] and various other pieces are already in place, still a bumpy ride for sure at this point, but might be easier in the mid term [09:22:47] moritzm: +1, I didn't think about it but yes! [09:23:06] the main issue could be that hadoop client packages need to be tested in there [09:23:16] since the host will be used mostly by the research team [09:23:23] so they'll need hadoop access for sure [09:23:25] what's the tentative GPU processing framework to be used there? [09:23:52] at the moment nobody is really looking at it due to lack of time (and more pressing goals as we know :) [09:24:17] some info is collected in https://phabricator.wikimedia.org/T148843 [09:25:24] probably the main use case would be to make tensor flow working with the GPU [09:32:10] ok, without a specific solution targeted it's hard to estimate whether it works in stretch or not :-) but I think it's probably safe to say that the support will be better in buster in any case, we might have basic buster services running by end of week [09:43:47] (03PS2) 10Joal: Update delete/restore in mediawiki-history [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/485710 (https://phabricator.wikimedia.org/T213603) [10:56:03] still wondering what's best for camus on the testing cluster [11:04:04] we could even try to see what are the effects to kafka running camus manually for a bit on the testing cluster [11:04:25] maybe with a flag to reduce the number of mappers [11:23:33] I'll ask to andrew this afternoon :) [11:26:23] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10elukey) This is done, the only thing missing is decide how camus should... [11:26:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10elukey) [11:27:27] 10Analytics, 10Analytics-Kanban: Run critical Analytics Hadoop jobs and make sure that they work with the new auth settings. - https://phabricator.wikimedia.org/T212259 (10elukey) [11:28:20] this one needs to be broken down in more pieces --^ [11:33:46] 10Analytics, 10User-Elukey: CDH Jessie dependencies not available on Stretch - https://phabricator.wikimedia.org/T214364 (10elukey) p:05Triage→03Normal [11:39:57] 10Analytics, 10Operations, 10User-Elukey: Archiva relies on a tmpfs directory that is wiped after each reboot - https://phabricator.wikimedia.org/T214366 (10elukey) p:05Triage→03Normal [11:40:15] all right lunch + errand, ttl! :) [13:30:40] 10Analytics: [Bug] Type mistmatch between NavigationTiming EL schema and Hive table schema - https://phabricator.wikimedia.org/T214384 (10phuedx) [13:39:32] 10Analytics: [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema - https://phabricator.wikimedia.org/T214384 (10phuedx) [14:37:42] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - DB size - https://phabricator.wikimedia.org/T212763 (10Milimetric) > - The XML file with current pages, including user and talk pages (uncompressed) > - The full history dumps (uncompressed) As Nuria said, this depends on how the XML is structure... [14:37:45] 10Analytics, 10Operations, 10Research, 10Article-Recommendation, 10User-Marostegui: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:38:53] 10Analytics, 10Operations, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:39:47] hey teammmmm :] [14:42:30] 10Analytics, 10Operations, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) Thanks everyone for the discussion. I've added a summary to the task description. @Pchelolo @Marostegui @Dzahn @Nuria @Ottom... [14:43:16] 10Analytics, 10Operations, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) [14:51:36] mforns: o/ [14:51:46] all the spark related jobs should now be kicked off by timers [14:52:14] elukey: woohooo [14:52:16] the refine jobs too? [14:53:16] yep! [14:53:20] everything [14:53:24] even report updater [14:53:31] awesoome [14:53:46] nothing is screaming now but let's triple check that everything works :) [14:53:53] still need to do the final puppet clean up [15:04:39] elukey, cooool, I used systemctl today and is great :] [15:06:33] nice! [15:09:30] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Security-Team, and 3 others: Modern Event Platform: Stream Intake Service: AJV usage security review - https://phabricator.wikimedia.org/T208251 (10Ottomata) Awesome, thank you! [15:12:00] 10Analytics: Deprecate Spark 1.6 in favor of Spark 2.x only - https://phabricator.wikimedia.org/T212134 (10Ottomata) Sounds good ya I can do this! I'll send an email today and we can do it after All-Hands. [15:13:10] \o/ [15:15:04] ottomata: I'd have a couple of questions for you related to the hadoop testing cluster, whenever you have time [15:15:24] !log Restarted turnilo to clear deleted datasource [15:15:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:17:54] elukey: gimme just a few more maybe 10 mins... [15:17:58] to get through email [15:18:18] ottomata: even tomorrow, no rush :) [15:18:25] we can discuss it during ops sync [15:20:17] elukey: today is my only work day this week! i'm flying to SF tomorrow [15:20:19] for wiki lead [15:20:28] actualy i might be able to work a little in the morning before my flight [15:20:54] ahhh! [15:24:57] joal: shall I go through all the patches and merge refinery/refinery-source and then deploy? [15:25:19] (not including the quality patch you sent yesterday, still trying to wrap my mind around that) [15:29:10] 10Analytics, 10Analytics-EventLogging, 10Discovery, 10EventBus, 10Services (watching): Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate - https://phabricator.wikimedia.org/T214080 (10Ottomata) These events (for now) will got to the jumbo-eqiad Kafka cluster.... [15:31:23] elukey: reading emails see your phab ticket. is the libssl1.0.2 package installed? [15:32:11] it is yes [15:32:33] elukey@analytics1039:~$ dpkg -l | grep libssl [15:32:33] ii libssl1.0.0:amd64 1.0.2o-1~wmf1 amd64 Secure Sockets Layer toolkit - shared libraries [15:32:36] ii libssl1.0.2:amd64 1.0.2q-1~deb9u1 amd64 Secure Sockets Layer toolkit - shared libraries [15:32:39] ii libssl1.1:amd64 1.1.0j-1~deb9u1 amd64 Secure Sockets Layer toolkit - shared libraries [15:32:52] hm [15:33:03] so this should be the same as it is on analytics-tool1003, right? [15:33:08] hm [15:33:14] tools1001 [15:33:18] tool1001 sorry [15:33:58] since we have the source of the package we might just try to build our own hue-common version [15:34:03] with proper deps [15:34:22] (don't remember if it is hue or hue-common that want libssl1.0.0) [15:34:37] maybe there's also a way to avoid libssl to be loaded and I missed it [15:34:40] o righ t1001 [15:34:51] I opened the task to avoid forgetting about it :) [15:34:54] ah ok [15:35:01] hm, but why doesn't it work as is?... [15:35:13] i had a simliar prob with python with superset stuff [15:35:21] ah even for superset? [15:35:23] had to get a newer python cryptography version [15:35:28] but here we can't just change deps [15:35:45] which machine is this running on in test cluster? [15:35:52] analytics1039 [15:49:54] oh, elukey this error is also happening on analytics-tool1001 [15:49:58] but hue runs anyway...? [15:50:45] is it? for some reason the install step was failing due to that, and it fixed when I installed the jessie package [15:50:48] like on tool1001 [15:50:53] the jessie package? [15:50:58] libssl1.0.0 [15:51:07] oh right [15:51:28] confused. [15:51:41] i thought the problem happens when hue is trying to start [15:51:55] did we manually install libssl1.0.0 on tool1001 when we installed? [15:52:08] yes but during install, IIRC, the init script was trying to start the daemon failing, and apt returned errors [15:52:26] yes I think so [15:52:28] ah [15:52:29] hm [15:52:48] i see, so you are just suggesting to fix this whole problem (rather than adding a 1.0.0 dummy package for stretch) by rebuilding a custom hue package? [15:52:50] because the package that moritz created (the dummy dep) on the cdh component is not installed [15:53:12] yes if possible it would be great, removing also the other weird mysql jessie dep [15:53:16] hm [15:53:26] I think it is in hue-common [15:53:27] aren't these cdh packges 'jessie' packages anyway? [15:54:01] they are but there is no point in my opinion to hardcode libssl1.0.0, we could for example try to set 1.0.2 [15:54:25] hm [15:54:29] hm [15:54:32] we are also kinda forcing them into a stretch system [15:54:35] yes [15:54:46] but it sounds like if we do it for hue...we'd want to do it for all? [15:54:47] the other alternative is copying the jessie packages to the cdh component [15:55:08] hm, that's not a bad idea [15:55:11] well hue is a corner case, we don't have other similar use cases right? [15:55:17] not sure, i guess not. [15:55:25] it is not super great that we keep libssl1.0.0 in there [15:55:31] (in the cdh component I mean) [15:55:50] I'll spend some time tomorrow to figure out why hue wants libssl1.0.0 [15:55:59] hm, i guess i'm fine with either, dunno how hard it is to rebuild hue [15:56:00] I am planning to do non invasive work before SF :P [15:56:06] ok [15:56:13] its probably an out of date python crypto lib [15:56:23] so if you rebuild, you'll have to update the python requirements [15:56:27] yes good bet, it would make sense [15:56:33] to a newer one that reuquires a later libssl [15:56:47] but then you also have to make sure the newer crypto is still compatible with hue [15:56:48] :) [15:56:52] or possibly remove the crypto deps [15:57:00] hm, i think you'll need those [15:57:03] or, maybe we don't [15:57:07] since we use nginx for the ssl [15:57:13] hm [15:57:22] exactly, and we don't use tls between nginx/varnish and hue [15:57:25] oh but ldaps mauybe [15:57:30] dunno, it might need them to talk to ldap [15:57:44] ah right it does it by itself [15:57:47] ufff [15:57:57] let's upgrade to CDH 6! [15:57:58] :P [15:58:30] ah! [15:58:30] https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_fixed_in_516.html#fixed_issues [15:58:34] 5.16.1 is out [15:59:07] let's upgrade to bigtop! [15:59:21] let's upgrade to that other werido dist with better security and forget kerberos! :D [15:59:21] YES --^ :) [16:00:30] sure but after that I'll need to get medical attention in a psych ward [16:00:41] :D [16:00:42] haha [16:01:19] a-team: standdduppppp [16:03:14] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Halfak) I assumed that stat1006/7 would have the same basic puppet config. I wonder why there is a difference. The problems are resolved by git lfs 2.6.1. I can't say whether or not... [16:10:20] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10elukey) Thanks for the feedback! >>! In T214089#4899360, @Halfak wrote: > I assumed that stat1006/7 would have the same basic puppet config. I wonder why there is a difference. I thi... [16:32:13] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Halfak) Great! Thank you for your help with this :) [16:33:34] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Update git lfs on stat1006/7 - https://phabricator.wikimedia.org/T214089 (10Ottomata) > I assumed that stat1006/7 would have the same basic puppet config Just FYI: They don't. They share some common things, but they apply different puppet roles and have differ... [16:35:02] (03CR) 10Milimetric: [C: 03+2] Join to new actor and comment tables [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/476553 (https://phabricator.wikimedia.org/T210543) (owner: 10Milimetric) [16:35:40] (03CR) 10Milimetric: [C: 03+2] Update mediawiki-history comment and actor joins (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/480796 (https://phabricator.wikimedia.org/T210543) (owner: 10Joal) [16:38:36] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update sqoop selects for new mediawiki schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476100 (https://phabricator.wikimedia.org/T210541) (owner: 10Milimetric) [16:41:01] Does wmf analytics have any way of determining how many "active" editors there were on a given project at a given point in time / month? [16:42:22] addshore: that metric is calculated "monthly" [16:42:32] (03CR) 10Milimetric: Update hive and oozie for labs/prod sqoop (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476855 (https://phabricator.wikimedia.org/T210542) (owner: 10Joal) [16:42:58] addshore: as in "active editors for february in jpwiiki" , you coudl also calculate it daily but it will be less meanigful [16:43:57] addshore: "active editors" in new times where logins are shared is also easier to calculate that going back [16:44:13] nuria: thanks! :) [16:44:52] nuria: are there some docs on wikitech for that that I should be looking at? [16:45:26] addshore: they will be on meta, active editors is not yet in wikistats, it will be though [16:45:37] (03PS7) 10Joal: Update hive and oozie for labs/prod sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476855 (https://phabricator.wikimedia.org/T210542) [16:46:12] addshore: i mean , it is there for 1 proyect: https://stats.wikimedia.org/v2/#/it.wikipedia.org/contributing/editors/normal|line|2-Year|editor_type~anonymous*group-bot*name-bot*user [16:47:11] addshore: in the api (noyt UI) you can split by activity level/content and type of editor [16:47:15] addshore: will taht work? [16:47:17] *that [16:47:45] addshore: see https://meta.wikimedia.org/wiki/Research:Wikistats_metrics/Editors [16:50:32] (03PS8) 10Joal: Update hive and oozie for labs/prod sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476855 (https://phabricator.wikimedia.org/T210542) [16:51:34] 10Analytics, 10Discovery: Workflow to be able to move data files computed in jobs from analytics cluster to production - https://phabricator.wikimedia.org/T213976 (10Ottomata) FYI, I believe @halfak and the #ores folks have a use case for this too. They build some models on stat1007 and use git lfs to push th... [16:52:06] biking home, back in a bit [16:53:20] 10Analytics: [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema - https://phabricator.wikimedia.org/T214384 (10Nuria) Types are inferred on incoming data, something to be aware is that if a field defined as number comes with values 1,27,300 it will be stored as an integer (regardless of... [16:57:43] 10Analytics, 10Operations, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Nuria) @bmansurov I am guessing that option 2 is the most likely one, in any case I want to stress that we should really be working on... [16:58:09] (03PS9) 10Joal: Update hive and oozie for labs/prod sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476855 (https://phabricator.wikimedia.org/T210542) [16:59:11] (03PS10) 10Joal: Update hive and oozie for labs/prod sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476855 (https://phabricator.wikimedia.org/T210542) [16:59:40] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update hive and oozie for labs/prod sqoop [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476855 (https://phabricator.wikimedia.org/T210542) (owner: 10Joal) [17:00:58] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update mediawiki_history oozie job datasets [analytics/refinery] - 10https://gerrit.wikimedia.org/r/483692 (https://phabricator.wikimedia.org/T213524) (owner: 10Joal) [17:02:43] (03PS2) 10Milimetric: Add nap.wikisource to whitelist.tsv [analytics/refinery] - 10https://gerrit.wikimedia.org/r/478942 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [17:03:13] (03PS1) 10Joal: Bump changelog.md to v0.0.84 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/485853 [17:03:18] milimetric: --^ [17:04:20] 10Analytics, 10Operations, 10Research, 10Services, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10bmansurov) @Nuria, thanks for the input. I suppose you mean the option 2 of the first point. > Has that work started? I'm currently wor... [17:04:35] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "thanks!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/478942 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [17:05:11] (03PS3) 10Milimetric: [SPIKE] [Don't merge] [analytics/refinery] - 10https://gerrit.wikimedia.org/r/466730 [17:17:15] (03PS1) 10Milimetric: [SPIKE] need to analyze why these are so different. You can see the differences by running meld between the two files. The project names have been formatted to match, but there are still a few false negatives due to naming. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/485857 [17:20:08] (03CR) 10Milimetric: [C: 03+2] Bump changelog.md to v0.0.84 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/485853 (owner: 10Joal) [17:22:20] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Bump changelog.md to v0.0.84 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/485853 (owner: 10Joal) [17:34:55] joal: o/ [17:35:05] is the refinery-download-project-namespace-map something that I can execute now to test if it works? [17:35:18] or it needs to be run every start of the month only? [17:35:23] (super ignorant about it) [17:35:50] I'd like to make sure that it doesn't break since the next run will be during the all hands :) [17:36:37] elukey: If you run it now, it'll overwrite the current month values - Not a big deal, but worth testing manually I assume [17:37:11] joal: ah you mean not using /wmf/data/raw/mediawiki/project_namespace_map but say /user/elukey/test [17:37:20] yessir [17:37:25] ack makes sense [17:37:31] Thanks :) [17:49:25] joal: [17:49:26] drwxr-xr-x - elukey elukey 0 2019-01-22 17:47 /user/elukey/test-namespace-map/snapshot=2018-12 [17:50:10] looks good afaics [17:50:41] elukey: minimal difference in size for files the folder contains - LGTM :) [17:50:56] elukey: Thanks a lot for the check :) [17:51:13] gooood [17:51:37] there were a couple of nits to fix, so having it triple checked was good :) [17:52:53] :) Gone for diner - Back in a bit [18:06:26] ottomata: is it a good time to chat? (even on IRC) [18:07:42] oh elukey yes sorryt [18:07:46] thought it was just about the hue thing [18:07:48] now is great [18:07:52] cando irc or bc whichever you prefer [18:09:02] irc is fine! [18:09:21] so the first question is if you think that a druid testing cluster on ganeti could be good [18:09:34] basically I'd like to couple it with the testing cluster [18:09:46] oh yes, don't see why not.... [18:09:48] and see if our loading/drop jobs works [18:09:52] if just for testing, we could maybe eveen colocate it? [18:09:55] instead of ganeti? [18:09:58] either is fine with me [18:10:09] co-locating on the hadoop nodes? [18:10:17] oh wait but there are dep issues right? [18:10:20] with e.g. zookeeperpackage? [18:10:28] that was the suggestion ya [18:10:33] but maybve ganeti is easier [18:10:39] we don't really need a full cluster even, do we? [18:10:42] a single node would be ok? [18:10:59] yep I think so [18:11:07] this is a good suggestion [18:11:27] since we don't really need any redundancy [18:11:28] mmmm [18:12:03] I can decom one worker node and apply the druid role on it, should be even quicker [18:13:00] will check tomorrow, thanks :) [18:13:11] second question is about Camus in the testing cluster [18:13:25] elukey: that's a good idea (decom one worker) [18:13:28] testing camus ya? [18:13:32] oh you are worried about load on kafka ya? [18:13:34] yeah [18:14:10] I was thinking to dial down the number of mappers but then we might end up in a lagging camus [18:14:17] that is not the end of the world for testing [18:14:31] any ideas? [18:17:33] elukey: i'm looking to see if we can limit the partitions camus consumes from [18:17:37] kafka partition [18:17:53] that was the other idea that I forgot! [18:19:46] i don't see it.... [18:19:59] i see how we could add it to our camus build... [18:20:40] so basically the idea would be to add a parameter that can take the partitions to read (default all) [18:20:54] and use it for the camus in the testing cluster [18:21:04] say to read from 5 partitions as starting point [18:21:17] or even 1 [18:22:46] yeah [18:22:47] elukey: but then we'd have to write some java code for camus and rebuld [18:22:48] but [18:22:55] that's not hard; we already maintain a fork [18:23:00] but mehhhh [18:23:05] I can attempt a patch tomorrow [18:23:18] not sure how else to do it though... [18:23:25] just running camus as is would be a lot [18:23:25] hmmm [18:23:37] it seems reasonable to add a patch, I mean it shouldn't be that huge [18:23:37] elukey: we could reproduce webrequest to a diff topic for testing [18:23:40] and only use a singel partition [18:23:46] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Develop a library for JSON schema backwards incompatibility detection - https://phabricator.wikimedia.org/T206889 (10Pchelolo) Before we begin, some bike-shedding needs to be done. 1. Lan... [18:23:47] and just change the topic [18:24:19] ottomata: like a spark job that reads from only one partition and produces to webrequest_text_test or similar? [18:25:34] 10Analytics, 10Performance-Team: [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema - https://phabricator.wikimedia.org/T214384 (10Krinkle) [18:26:57] sure [18:27:03] spark job or anything [18:27:08] could even be kafkatee [18:27:13] or a kafkacat chain [18:27:25] we just want this temp for testing/devel purposes, right elukey? [18:28:39] yep yep [18:29:09] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Develop a library for JSON schema backwards incompatibility detection - https://phabricator.wikimedia.org/T206889 (10Ottomata) Yes node is fine with me! Agree report on incompatibilities w... [18:30:24] elukey: FYI if you do want to try a camus patch [18:30:29] https://github.com/wikimedia/analytics-camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/mapred/EtlInputFormat.java#L322 is about where it would be done [18:30:59] ack [18:31:01] here its iterating over the topic partitions and creating the puill jobs [18:31:19] so we'd need some config to map topics to partitions to consume, if provided, just skip partitions not whitelisted [18:31:20] the spark/kafkatee solution could be an easy one though [18:31:26] elukey: yeah that is probably much easier [18:31:38] all right you answered to all my questions, thanks a lot! [18:31:40] you could even run it as a one off when you are testing [18:31:48] just consume | produce into a topic [18:31:49] run camus. [18:31:55] ack [18:31:57] no need to even run it realtime [18:32:26] I didn't deploy the camus profiles for the testing coordinator due to the crons/timers [18:32:32] didn't want to start anything [18:32:44] but I'll make a plan tomorrow [18:33:29] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Develop a library for JSON schema backwards incompatibility detection - https://phabricator.wikimedia.org/T206889 (10Pchelolo) There's already `json-schema-compatibility` lib on npm, but th... [18:33:29] going off for dinner now, if I don't see you tomorrow have a nice trip! [18:34:01] laters! [18:34:36] 10Analytics, 10Performance-Team: [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema - https://phabricator.wikimedia.org/T214384 (10Krinkle) >>! From **[wikitech.wikimedia.org](https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines#Schema_set_up)**: > [..]... [18:34:58] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Develop a library for JSON schema backwards incompatibility detection - https://phabricator.wikimedia.org/T206889 (10Ottomata) I think github too, but am not opinionated on this one. :) [18:41:14] 10Analytics, 10Performance-Team: [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema - https://phabricator.wikimedia.org/T214384 (10Ottomata) Yeah, that is a big problem we ran into (and I didn't realize when I wrote that part of the wikitech page). Since the integers are always int... [18:47:17] 10Analytics, 10Operations, 10Research, 10Article-Recommendation, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10Pchelolo) [18:49:51] 10Analytics, 10Operations, 10User-Elukey: notebook/stat server(s) running out of memory - https://phabricator.wikimedia.org/T212824 (10crusnov) Just as an extra data point, early morning 2019-01-22 nagios-nrpe-server crashed on stat1007 from a cannot allocate error. [18:50:55] 10Analytics, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10MusikAnimal) >>! In T210313#4895391, @Tgr wrote: > ... . The API works fine with images. Are we sure? This is for January 20's Today... [18:54:09] 10Analytics, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10Tgr) You are right, I didn't look at the output, just that it gives an OK response. Image views are int the same file so probably a simple fix though? Maybe @Harej remembers if that... [18:58:25] Hey milimetric - Back from diner - How is it going? [18:59:23] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Develop a library for JSON schema backwards incompatibility detection - https://phabricator.wikimedia.org/T206889 (10mobrovac) >>! In T206889#4899969, @Pchelolo wrote: > 1. Language. I prop... [19:00:25] Just finishing lunch joal, everything’s fine, sync to hdfs done, will keep going in a few minutes [19:00:32] great :) [19:00:41] oh yeah, log [19:00:50] Good idea ;) [19:00:56] Thanks ! [19:00:57] !log deployed refinery with refinery-source [19:00:58] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:07:11] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Develop a library for JSON schema backwards incompatibility detection - https://phabricator.wikimedia.org/T206889 (10Ottomata) > I think, at least in the first iteration, it should not reso... [19:08:44] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform: Stream Connectors - https://phabricator.wikimedia.org/T214430 (10Ottomata) p:05Triage→03Normal [19:09:24] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 3 others: Modern Event Platform (TEC2) - https://phabricator.wikimedia.org/T185233 (10Ottomata) [19:21:49] (03PS2) 10Krinkle: Add ServerTiming to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476841 (https://phabricator.wikimedia.org/T207862) (owner: 10Gilles) [19:22:22] (03CR) 10Krinkle: [C: 03+1] "(was this meant to be merged? Note that the repo doesn't currently have Verified tests and thus no auto-merge on CR+2)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476841 (https://phabricator.wikimedia.org/T207862) (owner: 10Gilles) [19:35:00] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Product-Analytics, and 4 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10Ottomata) I'd like to first tackle standardizing HTTP request information fields. E.g. `user_agent`, `ip`, `a... [19:37:09] 10Analytics, 10Tool-Pageviews: Statistics for views of individual Wikimedia images - https://phabricator.wikimedia.org/T210313 (10Harej) >>! In T210313#4900059, @Tgr wrote: > Maybe @Harej remembers if that was an intentional limitation or a bug. I wanted to expand mediaplaycounts-api to include static images... [19:57:50] 10Analytics, 10Analytics-Kanban, 10Page-Issue-Warnings: event_pageissues Turnilo view contains no valid data from before January 5 - https://phabricator.wikimedia.org/T214136 (10Tbayer) >>! In T214136#4896765, @mforns wrote: [...] >> Thanks! Could we include a slightly longer timespan? This is basically data... [20:02:03] (03CR) 10Nuria: [V: 03+2 C: 03+2] Add ServerTiming to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/476841 (https://phabricator.wikimedia.org/T207862) (owner: 10Gilles) [20:13:48] 10Analytics-EventLogging, 10Analytics-Kanban, 10EventBus, 10Product-Analytics, and 4 others: Modern Event Platform: Schema Guidelines and Conventions - https://phabricator.wikimedia.org/T214093 (10Ottomata) FYI, the patch for review for CirrusSearchRequestSet is here: https://gerrit.wikimedia.org/r/#/c/med... [20:17:35] (03PS1) 10Milimetric: Update jar versions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/485888 [20:17:57] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Update jar versions [analytics/refinery] - 10https://gerrit.wikimedia.org/r/485888 (owner: 10Milimetric) [20:19:17] milimetric: I thought our changes would not impact those jobs --^ [20:19:30] bd808: q [20:19:35] joal: oh, I thought we just missed the jar update [20:19:49] if the new api action schema just had the query params as a string as they are in the http req [20:19:51] would thaat work? [20:19:59] hive has a built in funciton [20:20:01] milimetric: IMO our changes in refinery-source don't impact checker nor reduce [20:20:04] parse_url [20:20:16] joal: ok, should I revert or is it ok to just have the latest jar anyway? [20:20:36] No bother, but no need to deploy for the thing to work ;) [20:20:41] milimetric: --^ [20:20:45] parse_url(concat('http://a.b.c/path', http_info.uri_query), 'QUERY', 'query_key1') [20:20:47] would return [20:20:50] yeah, I know, I was doing the hive stuff [20:20:51] the value of query_key1 [20:20:53] in some url [20:20:58] ?query_key1=abc [20:21:00] so abc [20:21:08] joal: do you think I need to figure out alter statements or should I just drop and recreate the tables that changed? [20:21:14] or i guess request params are also from POST? [20:21:49] milimetric: alter statements should be better for metastore - A lot of partitions those table have [20:22:11] 10Analytics, 10Operations, 10Research, 10Article-Recommendation, and 2 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10mobrovac) I agree that the most likely solution to work here is option (2), i.e. getting a host to execute it from. Perhap... [20:22:24] ok, cool. That metastore is very fragile, I should look into why it's slow to do basic things like partition management [20:22:31] milimetric: also, my last tests on hive table for avro data on page for instance showed errors - can you confirm the same for ou?: [20:22:31] 10Analytics, 10Operations, 10Research, 10Article-Recommendation, and 3 others: Transferring data from Hadoop to production MySQL database - https://phabricator.wikimedia.org/T213566 (10mobrovac) [20:22:54] joal: huh? like the page table is broken? [20:23:05] milimetric: it's not that fragile, it'll handle the repair correctly - But it'll take some time :) [20:23:40] I can select from mediawiki_page [20:23:58] milimetric: ok great [20:30:30] !log updated hive tables in wmf_raw for actor/comment refactor [20:30:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:37:32] milimetric: I have found a broken table - mediawiki_logging for snapshot after 2018-09 [20:38:16] AvroTypeException: Found string, expecting union [20:38:22] hm... [20:38:45] milimetric: I looked at the schemas using spark - log_user is expected to be a long in hive while it is a string in avro files [20:39:19] ok, will alter [20:39:35] milimetric: this is the thing I patched in the refinery-source - forgot to update the table :( [20:39:38] milimetric: my bad :( [20:40:10] hey I was there, I remember, but wait, weren't there a lot of columns like this? [20:41:49] joal: did the alter, selecting from it works for new snapshots, breaks for old. I guess that's ok [20:42:00] milimetric: I think it;s better this way [20:42:07] yep [20:42:16] milimetric: providing a patch for the core create [20:43:08] weird that in testing we got that issue with other tables like revision and archive, but here it seems to work. How random [20:45:26] milimetric: I found the patch - https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/478046 - Only log)user to correct [20:46:14] ah, joal ok, this is better, I thought I remembered looking at this [20:46:32] so when I did the alter I just didn't diff far enough back, because there was another deploy [20:46:40] this should've been altered during that previous deploy [20:46:49] (03PS1) 10Joal: Update mediawiki_logging table creation script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/485894 [20:46:52] milimetric: --^ [20:47:06] milimetric: thanks for the quick patch :) [20:49:01] joal: wait, doesn't this fix it actually? [20:49:02] https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L287 [20:49:30] in this change: https://github.com/wikimedia/analytics-refinery/commit/b683424a16c9f9d4f6ff82dbde7fc3b633f382d0 [20:49:52] so new data would come in as bigint [20:50:00] milimetric: maybe? hm [20:50:25] well, anyway, I'll merge your change and we can take another look after next sqoop, it's just metadata anyway [20:50:27] milimetric: then the patch for table change is only valid for now, and we're gonna need to roll-it back [20:50:53] it prevents requesting though - But yeah, we should be ok :) Let's not forget ;) [20:51:03] (03CR) 10Milimetric: [V: 03+2 C: 03+2] "This change https://github.com/wikimedia/analytics-refinery/commit/b683424a16c9f9d4f6ff82dbde7fc3b633f382d0 should mean that new data can " [analytics/refinery] - 10https://gerrit.wikimedia.org/r/485894 (owner: 10Joal) [20:51:26] making a task, no worries, off to sleep for you! [20:51:38] Thanks milimetric - the rest of the thing looks ok? [20:51:58] yeah, trying to think how I can launch this job to make a good test [20:52:04] k [20:52:05] kind of thinking I should probably sqoop and do everyhting [20:52:31] For a small subset of wikis it shouldn't be too long [20:54:00] 10Analytics: Check logging table after next sqoop for log_user type - https://phabricator.wikimedia.org/T214437 (10Milimetric) [21:06:03] 10Analytics, 10Performance-Team (Radar): [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema - https://phabricator.wikimedia.org/T214384 (10Gilles) [22:18:39] milimetric: yt? [22:18:46] hi nuria yep [22:18:56] milimetric: I have a question about the comment: "I don't think it's possible for this table to be non-private unless mediawiki changes the schema, so let's remove this create." [22:19:02] milimetric: i do not get it at all [22:19:22] milimetric: https://gerrit.wikimedia.org/r/#/c/476855/3/hive/mediawiki/history/create_mediawiki_actor_table.hql [22:19:26] oh, just confusing wording, must've written that before I had my morning cookie nuria [22:19:35] milimetric: jajaja [22:19:35] joseph had two versions of the same table there [22:19:45] he made one in the private folder and one in the normal folder [22:19:56] he was thinking we would sqoop actor and comment from labs at some point [22:20:05] but we can't, because it's slow there [22:20:12] so it would always be private [22:20:25] my comment should've said "this table will always be private so we don't need a public version" [22:20:34] milimetric: i see, not "private", all those tables are private [22:20:39] in cluster [22:20:52] milimetric: but rather the "production" version [22:20:52] mmmm, the ones from labs could technically be public [22:21:10] but yes, by "private" we mean sqooped from production replicas [22:21:15] milimetric: but we will not outsource the tables directly rather teh denormalized data [22:21:18] if it doesn't have "private" it means it's sqooped from cloud [22:21:48] milimetric: ok ya, private is not the best word there [22:21:50] nuria: yes, we will published denormalized. If it's useful to people though, we could put the raw cloud tables out, so people can query with presto [22:22:12] milimetric: it will be a PR nightmare to explain that they are way behind the db replicas [22:22:15] nuria: agreed, but it's hard to find one word that means all this, so we just went with private. Happy to rename, it wouldn't be hard, just didn't know of a better word [22:22:37] milimetric: ok, elt me think [22:22:37] nuria: yeah, it could be confusing, maybe we wait until we get realtime replication and processing [22:23:27] milimetric: how about "nonpublic" [22:23:33] milimetric: is that horrible/ [22:23:45] "nonpublicdata" is the thing here [22:23:47] ? [22:25:52] nuria: how would that be different from private? [22:26:14] milimetric: because all those tables are private , tehy are on teh private cluster , right? [22:26:23] milimetric: these tables however have nonpublic data [22:26:57] milimetric: or maybe all other tables could be prefixed with "public" [22:27:39] nuria: oh, I understand now the confusion [22:27:58] nuria: we did think about this actually and we think the fix is to move these out of wmf_raw [22:28:16] because the context in there is too confusing and you have situations like this [22:28:22] milimetric: but wait, they will still be all private [22:28:29] so my thought was to make a separate hive db, called mediawiki [22:28:40] and another one called like mediawiki_nonpublic or whatever we want to call it [22:28:44] and split up the tables that way [22:29:04] or probably better like mediawiki_public and mediawiki, to your point [22:29:41] that would differentiate both the final product and the tables sqooped from cloud from the rest of the private stuff on the cluster [22:29:57] nuria: does that work? We were thinking we'd do that later, after we fix quality [22:30:23] 10Analytics, 10good first bug: Productionize job for Global Innovation Index from Hadoop Geowiki data - https://phabricator.wikimedia.org/T190535 (10Milimetric) p:05Normal→03Triage [22:30:40] 10Analytics, 10good first bug: Productionize and run 2018 job for Global Innovation Index from Hadoop Geowiki data - https://phabricator.wikimedia.org/T190535 (10Milimetric) p:05Triage→03High a:03Milimetric [22:32:28] milimetric: that seems best, agreed . In the meantime having those tables be called "private" is going to be very confusing. We are going to get a stream of questions about naming [22:32:44] nuria: they've been named that for a year :) [22:32:57] milimetric: REALLY? [22:32:59] the actor and comment are just two new tables, there's others in there like cu_changes [22:33:17] yeah, and nobody asked anything, so no worries, people don't pay attention to the guts of the cluster, like wmf_raw [22:33:29] I agree we should change it, just no rush [22:33:49] also, it's a big change and right now we have to focus on the join refactor [22:33:55] (which I'm testing right now) [22:34:05] milimetric: k [23:02:13] 10Analytics, 10EventBus, 10Services: EventBus mediawiki extension should support multiple 'event service' endpoints - https://phabricator.wikimedia.org/T214446 (10Ottomata) p:05Triage→03Normal [23:16:42] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Core Platform Team (Modern Event Platform (TEC2)), and 2 others: Develop a library for JSON schema backwards incompatibility detection - https://phabricator.wikimedia.org/T206889 (10Nuria) > json-schema-compatibility-checker +1 to this, this is how you w... [23:58:49] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), and 3 others: Spin out a tiny EventLogging RL module for lightweight logging - https://phabricator.wikimedia.org/T187207 (10Nuria) @Krinkle: I tested in vagrant with navtiming sample 1/1 i...