[00:09:11] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 2 others: Visualize page create events for all wikis - https://phabricator.wikimedia.org/T170850#3543470 (10kaldari) FWIW, there's a list of all the wiki databases at https://phabricator.wikimedia.org/source/mediawiki-config/br... [03:25:43] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3543672 (10Shilad) @herron I am having some trouble logging in. I can get to bastion but not beyond. I'm suspicious that the key I gav... [04:39:27] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3482327 (10jeremyb) that's maybe exactly the problem. your debug log says > debug1: match: OpenSSH_7.4p1 Debian-10+deb9u1 pat OpenSSH... [04:46:13] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3482327 (10Dzahn) I can confirm this. The reason is the key type DSS. From auth.log on stat1005: 85335 Aug 23 03:10:05 stat1005 sshd[... [06:45:19] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3482327 (10MoritzMuehlenhoff) Please don't add new DSA keys, we're down to two keys of that kind and I'm planning to remove server-sid... [07:58:48] (03PS1) 10Joal: Correct mediawiki_history hive table create script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/373249 [08:02:57] (03PS3) 10Joal: Add tagging as part of webrequest refine process [analytics/refinery] - 10https://gerrit.wikimedia.org/r/367940 (https://phabricator.wikimedia.org/T171760) (owner: 10Nuria) [08:03:31] (03CR) 10Joal: [V: 032 C: 032] "bug correction, self merging." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/373249 (owner: 10Joal) [08:04:15] (03CR) 10Joal: [V: 032 C: 032] "Merging for deploy." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/367940 (https://phabricator.wikimedia.org/T171760) (owner: 10Nuria) [08:07:32] (03PS2) 10Joal: Add project_family to webrequest normalized_host [analytics/refinery] - 10https://gerrit.wikimedia.org/r/362160 (https://phabricator.wikimedia.org/T168874) [08:11:37] (03PS2) 10Joal: Rename project_class to project_family [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/362159 (https://phabricator.wikimedia.org/T168874) [08:12:14] (03CR) 10Joal: [C: 032] "Self merging for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/362159 (https://phabricator.wikimedia.org/T168874) (owner: 10Joal) [08:14:28] (03PS3) 10Joal: Add project_family to webrequest normalized_host [analytics/refinery] - 10https://gerrit.wikimedia.org/r/362160 (https://phabricator.wikimedia.org/T168874) [08:14:48] (03CR) 10Joal: [V: 032 C: 032] "Self merging before deploy." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/362160 (https://phabricator.wikimedia.org/T168874) (owner: 10Joal) [08:15:51] (03Merged) 10jenkins-bot: Rename project_class to project_family [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/362159 (https://phabricator.wikimedia.org/T168874) (owner: 10Joal) [08:20:13] (03PS1) 10Joal: Add v0.0.50 to changelog.md for deployment [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373254 [08:20:52] (03CR) 10Joal: [V: 032 C: 032] "Self merging for deploy." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373254 (owner: 10Joal) [08:23:06] 10Analytics-Kanban: Add time_to_user_next_edit and time_to_page_next_edit in Mediawiki Denormalized History - https://phabricator.wikimedia.org/T161896#3146655 (10JAllemandou) Double checked values are present, everything seems good :) [08:25:39] !log Deploying refinery-source v0.0.50 using jenkins [08:25:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:01:00] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Investigate use-cases for delayed job executions - https://phabricator.wikimedia.org/T172832#3543950 (10phuedx) IIRC Readers Web doesn't actively maintain any extensions or skins that make use of delayed job executions. We had cons... [09:59:24] !log Kill oozie webrequest-load bundle for restart after deploy [09:59:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:59:59] !log Deploying refinery [10:00:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:06:41] !log Deploying refinery onto hdfs [10:06:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:09:58] !log Alter webrequest table before restarting oozie load bundle [10:10:06] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:12:39] !log Restart oozie webrequest-load bundle after deploy and updates [10:12:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:04:27] !log Update wmf.wdqs_extract table for normalized_host update [11:04:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:07:14] (03PS1) 10Joal: Update wdqs_extract hive table create script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/373268 [11:07:43] (03CR) 10Joal: [V: 032 C: 032] "Already changed in production, merging patch for code correctness." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/373268 (owner: 10Joal) [11:56:53] Looks like new deployed code is working good now that I corrected wdqs_extract [11:56:58] Taking a break :) [14:46:50] 10Analytics-Kanban, 10Analytics-Wikistats: Error handling - https://phabricator.wikimedia.org/T171487#3544990 (10fdans) [14:47:25] ottomata: Hello :) [14:47:36] joal hallooo [14:47:53] ottomata: do we go for a fast pre-standup ops round? [14:54:32] oh sure! [14:54:35] :) [14:54:35] coing [15:14:40] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo1004 h/w problem most likely raid card - https://phabricator.wikimedia.org/T173837#3545106 (10Cmjohnson) 05Open>03Resolved Received and replaced the raid controller! A million times better and it's working fine no... [15:14:42] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3545108 (10Cmjohnson) [15:15:27] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3352337 (10Cmjohnson) a:05Cmjohnson>03RobH the issue with 1004 has been resolved assigning to @robh to do installs. [15:41:15] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo1004 h/w problem most likely raid card - https://phabricator.wikimedia.org/T173837#3545260 (10Ottomata) Amazing! Thank you. [15:48:01] ottomata: nice queue made it to the cluster :) [15:50:44] fdans: Heya, do you mind if I let you go to the anti-harassment meeting? [15:52:41] joal: I just realised we had it! No problem, I got it :) [15:52:48] Thanks fdans [16:11:11] yeah! it just showed up! [16:11:46] however I think oozie removal will require restart - it can wait the next needed restart :) [16:11:51] Thanks for that ottomata [16:24:56] ya [16:52:10] hey ottomata - anything to discuss before the meeting? [16:52:43] hmm, not really, FJ asked to have it, i think he wants to just start syncing up on what to put in presentation, so it'll just be good to have your knowledge of how we are using druid there for that [16:52:58] okey :) [17:19:13] ottomata: im working on your kafka jumbo installs today =] [17:19:22] just starting actually [17:24:10] yeehawww! [17:24:14] thanks robh [17:24:27] welcome =] [17:26:05] ottomata: so these have 12 * 4TB + 2 * 1TB [17:26:11] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3545820 (10Krinkle) [17:26:11] I assume we wait a raid1 of the 1tb disks for the OS [17:26:17] and then a raid10 of the 12 4TB disks for /srv? [17:26:54] thats what we had in the discussion on the purchase [17:26:59] but i wanted to confirm before install [17:27:10] We have 12 x 2TB disks now, and from an eyeball of one box, it looks like we are using on average 30-40% of our space. I expect space usage on these boxes to grow, so I suppose if we can get drives larger that 2TB, that would be helpful. Maybe 3TB ones? [17:27:10] If we did decide to go RAID10, we'd need 12 x 4TB at least. I wouldn't want to reduce our total storage capacity. But, I'm pretty sure we don't want JBOD. [17:27:13] (from that thread) [17:35:05] robh [17:35:05] correcy [17:35:07] correct [17:35:27] cool, i dont see a partman recipe for just that, whic his shocking... [17:35:33] your most recent response here [17:35:33] https://phabricator.wikimedia.org/T167992#3493822 [17:35:35] most of them assume the 12 * whatever are in jbod [17:35:37] is good [17:35:52] cool [17:36:06] yeah ill have to brush off partman skills and apply them today, good times ;D [17:36:28] only used approximately every 6 months, just long enough to forget them all... [17:51:15] is anything wrong with the analytics cluster? my hive query runs then I just get "Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask" [17:54:27] Unless if I'm just not seeing something obviously wrong with the query? https://www.irccloud.com/pastebin/VgVbt2N2/internal_kartotherian_usage.hql [17:58:49] weird bearloga :( [17:58:58] hadoop seems fine [18:00:08] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3546452 (10herron) Hi @Shilad, your ssh key has been updated. Are you able to log in? Aug 23 17:54:50 stat1005 puppet-agent[4323]:... [18:05:27] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 2 others: Visualize page create events for all wikis - https://phabricator.wikimedia.org/T170850#3546487 (10Nettrom) @mforns : Thanks much for your help with this! I've set up the queries so they return two columns, with the se... [18:06:30] bearloga: any luck? [18:09:52] bearloga: can you put the results of `yarn logs -applicationId application_1498042433999_215362 > failed_hive.log` somewhere? [18:10:07] bearloga: just running that in your home dir on stat1005 or something where i can glance at the verbose logs [18:10:23] ebernhardson: Thanks for helping :) [18:10:34] gone for tonight lads [18:13:20] bearloga: actually i found the logs in another interface: Caused by: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveArrayInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector [18:13:25] i'm sure that means a lot to you :P [18:13:41] but it basically means something that was expecting a primitive type, like a string or number, received an array [18:13:45] ebernhardson: yup, that was the specific error message [18:14:58] ebernhardson: no idea why that would be the case for my query, though :\ [18:15:21] bearloga: the `yarn logs` command above should hopefully give something more detailed though, because i can only seem to see the logs from the job manager, but not the individual workers [18:15:40] ebernhardson: ah, okay. lemme get that for ya :) [18:16:18] joal: is it possible your recent change of the webrequest table for tags is doing this? [18:17:25] ebernhardson: output of the `yarn logs` command in stat1005:/home/bearloga/failed_hive.log thanks for looking into it! :) cc ottomata [18:19:37] hmm, so annoyingly while spark reports the shape of data files, hive doesn't :S might be able to find something though [18:22:39] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Investigate use-cases for delayed job executions - https://phabricator.wikimedia.org/T172832#3546556 (10dr0ptp4kt) @MarkTraceur, adding for your review as well in case anything pertinent for Multimedia. [18:23:11] bearloga: def something with query [18:23:20] i'm paring it down and getting same thign with [18:23:24] SELECT [18:23:24] DATE(CONCAT(year, '-', month, '-', day)) AS `date` [18:23:24] FROM wmf.webrequest [18:23:24] WHERE [18:23:24] webrequest_source = 'upload' [18:23:24] AND year = 2017 AND month = 8 [18:23:24] AND day = 21 AND hour = 12 [18:23:25] limit 5; [18:23:43] or hm [18:23:46] not sure yet [18:24:08] actually, the query is getting to simple [18:24:11] something else must be wrong [18:24:34] ottomata: tried it without `DATE(CONCAT(year, '-', month, '-', day)) AS date` just to eliminate some possibilities and same error [18:24:39] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 2 others: Visualize page create events for all wikis - https://phabricator.wikimedia.org/T170850#3546562 (10mforns) @Nettrom Cool! Yes, feel free to add any links or details that you find interesting to the docs! And yes, you... [18:25:18] yeah [18:25:19] joal: [18:25:28] i think whatever changes have been made on the webrequest table today [18:25:28] ottomata ebernhardson: so the funny thing is the essentially-same version of the query was working totally fine yesterday afternoon [18:25:31] make it not able to read old data [18:25:43] bearloga, these simple queries work on more recent data [18:25:53] which would have been written with the new altered webrequest schema [18:26:04] joal deployed changes today that added tags and also a new field to normalized_host [18:26:12] seeing as this is org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveArrayInspector [18:26:16] i'm guessing its the tags field [18:26:42] hmmm [18:26:44] bummer [18:26:51] i think jo al is gone [18:26:56] trying to think of what to do... [18:27:01] ottomata: yeah he signed off for the night [18:27:28] indeed it's complaining about trying to use an array as a primitive, and it looks like for arrays in the schema its just tags and normalized_host.qualifiers [18:28:16] never seen this before [18:28:25] in the past, when we've altered the table schema [18:28:31] the thing that strikes me odd -- and perhaps it's because I don't know how some of the mapreduce hive stuff works behind the scenes -- is the query doesn't reference either of those at all [18:28:32] missing fields in old data were just handled as null [18:28:46] yeah, its because the hive table itself has had its schema altered [18:28:58] but, the old parquet data (which also has a schema) in the older partitions has not [18:29:06] bearloga: that is odd, especially because parquet is columnar which means it shouldn't even try to look at unreferenced fields [18:29:15] ebernhardson: right?!? [18:29:15] right [18:29:17] usually it is fine [18:37:03] backparsing chan [18:37:26] joal, i'm testing this, creating a webrequest table in my db with a partition in oold data and in new data [18:37:31] then will alter the table and remove tags field [18:37:33] and see how it goes [18:38:20] k ottomata [18:38:25] I'm sorry bearloga :( [18:38:59] HMMMM [18:39:03] weird [18:39:09] joal: with a newly created table, i can't reproduce.... [18:39:15] i wonder if it was the alter statement that did it? [18:39:20] going to try that too [18:39:31] ottomata: e can always drop / recreate [18:40:17] let me verify [18:41:49] joal: do you have the exact alter command you ran? [18:41:59] ottomata: I can find it [18:42:46] ALTER TABLE wmf.webrequest CHANGE `normalized_host` `normalized_host` struct, tld: String> COMMENT 'struct containing project_family (such as wikipedia or wikidata for instance), project (such as en or commons), qualifiers (a list of in-between values, such as m and/or zero) and tld (org most often)' AFTER [18:42:52] `referer_class`; [18:42:58] ALTER TABLE wmf.webrequest ADD COLUMNS (`tags` array COMMENT 'List containing tags qualifying the request, ex: [portal, wikidata]. Will be used to split webrequest into smaller subsets.') [18:44:06] GRR [18:44:08] can't reproduce [18:44:09] HMM [18:44:15] ottomata: now is a right time if we want to drop / recreate [18:44:23] But it's weird :( [18:44:46] i did the alter you just did (just for tags though...) [18:44:47] hmm [18:44:50] ok lemme try again [18:44:53] without the project_family stuff [18:44:54] and see [18:44:58] ottomata: there is one job to wait for (webrequest-load text) [18:45:55] ottomata: I think it comes from the struct [18:46:21] ottomata: It tries to read an array of string (qualifiers) where there only is a string [18:46:39] ya, but if this didn't work [18:46:49] then none of my eventlogging refine stuff would work! [18:46:49] :) [18:47:16] ottomata: I'd like to try adding it at the end, not middle (my bad on this) [18:47:17] yup it is the struct [18:47:18] can reproduce [18:47:27] you think its because you added proejct_family in middle? [18:47:44] hmm, but recreating the table was fine??? [18:48:47] joal [18:48:48] confirm [18:48:54] right [18:48:56] adding project_family at end of struct still works on old data [18:49:13] MWARF [18:49:18] joal [18:50:17] ottomata: correct, redeploy, rerun? or rollback, rerun? [18:50:37] i think we can just run a simple alter [18:50:39] do change it [18:50:41] i'm testing now [18:50:48] i altered normalized_host [18:50:56] put project_family at end [18:51:05] old data gives NULL [18:51:10] data will break on new data [18:51:16] hmmm [18:51:16] yes [18:51:18] new data is null [18:51:23] but it isn't broken [18:51:27] and project_class is still there [18:51:28] hmmmm [18:51:37] and probably wrong for tld or qualifiers, no? [18:51:41] joal: new deployed code is using project_family, right? [18:51:46] tld? [18:51:54] other struct fileds [18:51:59] tld ok [18:52:01] new deployed code is not [18:52:04] qualifiers? [18:52:32] hmm, i'm only looking at a few records [18:52:35] so, on new data, hive manages to read tld and qualifiers in struct in a reordered strcut ? [18:52:36] but it returns empty array [18:52:41] so far ya [18:52:43] right [18:52:47] sounds good then [18:52:50] but [18:52:56] yeah, the project_family is not populated correctly [18:53:02] on the existing new data [18:53:03] joal, we could alter [18:53:06] and rerun jobs from today? [18:53:40] we need to change java code as well (currently doing it) [18:53:51] why? [18:53:58] joal, going to pause refine job [18:54:00] load job [18:54:01] cause the order of the struct changes [18:54:08] good idea [18:54:10] does that matter for hte code? [18:54:18] it dies [18:54:20] does [18:54:31] ok [18:54:50] ottomata: for hive, a struct is an array [18:54:54] so the index matters [18:55:15] this is why I find it weird that we can correctly read new data with wrong ordering [18:56:30] joal the struct is not an array for parquet though, no? [18:56:42] shoudln't have be able to use the struct field name to pull the data out of parquet? [18:56:44] hive* [18:56:44] I don't think it is, but not sure [18:57:00] possible [18:57:07] but then, why break on old data then? [18:57:17] yeah [18:57:18] hmmm [18:57:21] that is weird for sure [18:57:21] dunno [18:57:28] ya but i guess this is why my json refine stuff works [18:57:35] it always adds new struct fields at the end of the struct [18:58:03] ok, you get code up and deployed, then we alter table, then restart jobs? [18:58:11] joal: just in case, shoudl we restart oozie jobs from earlier today [18:58:14] when you restarted them? [18:58:20] start time of 8 23 09 [18:58:21] ? [18:58:26] today, correct [18:58:30] 2017-08-23T09:00 [18:58:31] ?/ [18:58:48] it might be working, but as you say that is weird [18:58:55] (03PS1) 10Joal: Correct host normalization udf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373341 [18:58:57] so we should just re-run those webrequest load jobs, ya? [18:59:00] ottomata: --^ [18:59:22] crzy [18:59:23] I think we need to redeploy java before [18:59:33] (03CR) 10Ottomata: [C: 031] Correct host normalization udf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373341 (owner: 10Joal) [18:59:34] merge away [18:59:41] oh ya we need new version eh? [19:00:03] ottomata: yes :( [19:00:07] (03CR) 10Joal: [V: 032 C: 032] "Self merging for urgent deploy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373341 (owner: 10Joal) [19:00:48] Arf, too fast - sorry, will concentrate [19:02:16] i wonder if it only happens on certain hours, i just looked back at my oozie jobs and i have 1 failed hour, with the exact same class cast exception [19:02:45] (03CR) 10jerkins-bot: [V: 04-1] Correct host normalization udf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373341 (owner: 10Joal) [19:02:56] That's weird :( [19:03:14] ebernhardson: other jobs have worked so far, so I didn't notice - I'm very sorry [19:03:25] ebernhardson: it will only happen on hours before 09:00 today [19:03:29] utc [19:03:43] (03CR) 10jerkins-bot: [V: 04-1] Correct host normalization udf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373341 (owner: 10Joal) [19:06:04] (03PS2) 10Joal: Correct host normalization udf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373341 [19:06:35] should be better [19:06:42] waiting for jenkins validation [19:07:44] k [19:08:40] (03PS1) 10Joal: Bump changelog to v0.0.51 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373344 [19:10:24] joal: I'm just glad this problem was discovered sooner rather than later and I'm sorry you had to be pulled back into work after signing off :\ [19:10:58] don't worry bearloga - I'm glad as well it's been discovered now [19:11:02] thanks again bearloga :) [19:12:24] (03CR) 10Joal: [C: 032] Correct host normalization udf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373341 (owner: 10Joal) [19:12:59] (03PS1) 10Joal: Correct webrequest host normalization bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/373345 [19:13:13] ottomata: starting a deploy [19:14:25] ok joal [19:15:51] (03Merged) 10jenkins-bot: Correct host normalization udf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373341 (owner: 10Joal) [19:16:14] ottomata: +1 on https://gerrit.wikimedia.org/r/#/c/373344/1 ? [19:16:53] (03CR) 10Ottomata: [C: 031] Bump changelog to v0.0.51 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373344 (owner: 10Joal) [19:16:54] done [19:17:00] thanks [19:17:10] (03CR) 10Joal: [C: 032] Bump changelog to v0.0.51 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373344 (owner: 10Joal) [19:17:42] (03CR) 10Joal: [V: 032 C: 032] Bump changelog to v0.0.51 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373344 (owner: 10Joal) [19:21:23] joal: lemme know if/when i can help [19:21:35] ottomata: currently waiting for jenkins [19:22:01] ottomata: I'm going to to kill the loading job you suspended, and apply the ALTER TABLES why waiting - makes sense? [19:24:43] ok [19:24:44] yes [19:24:47] go ahead [19:24:55] Doing [19:25:14] !log Kill oozie webrequest-load bundle for redeploy after bug correction [19:25:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:26:21] !log Alter wmf.webrequest and wmf.wdqs_extract tables to correct bug [19:26:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:29:00] bearloga: your request is back in the game :) [19:30:14] joal: trying now :) [19:31:38] ottomata: tried to access project_family on new data while being at end: error as well (expected) [19:34:06] joal: eh? [19:34:11] oh new data already created today [19:34:12] you mean [19:34:12] yeah [19:34:12] ok [19:34:18] dunno why it worked for me (well, didn't work, it gave nulls) [19:34:33] for it actually breaks :) [19:34:45] joal: query works for me now! :) [19:34:59] Yay bearloga :) [19:35:06] * joal is happy :) [19:36:19] (03PS2) 10Joal: Correct webrequest host normalization bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/373345 [19:36:26] ottomata: --^ real quick? [19:36:48] !log Deployed werbrequest-source using jenkins for bug correction [19:36:49] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:39:49] (03CR) 10Ottomata: [C: 031] Correct webrequest host normalization bug [analytics/refinery] - 10https://gerrit.wikimedia.org/r/373345 (owner: 10Joal) [19:39:52] +1 joal [19:40:02] i guess its ok to keep record version the same since we are going to overwrite the data we created today [19:40:03] ya? [19:40:04] Great, merging + deploying, thanks ottomata [19:40:11] I think so ottomata [19:40:42] (03CR) 10Joal: [V: 032 C: 032] "Self merging for urgent deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/373345 (owner: 10Joal) [19:41:22] !log Deploying refinery from tin [19:41:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:46:26] !log Deploy refinery onto hdfs [19:46:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:50:02] !log restart oozie webrequest-load bundle after bug correction [19:50:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:52:06] ok ottomata, taking a break for diner, will be back after to double check [19:54:54] ok great [19:55:01] i will check too after a job runs [20:02:54] ooo, joal do we need to restart the wikidata-wdqs_extract stuff? [20:04:10] uh oh joal [20:04:10] Failed with exception java.io.IOException:java.lang.RuntimeException: Hive internal error: conversion of string to arraynot supported yet. [20:09:25] 10Analytics-Kanban, 10Patch-For-Review: Troubleshoot Wikimetrics "magic button" - https://phabricator.wikimedia.org/T173585#3546923 (10mforns) We found out what the cause of the issue is: Wikimetrics is connecting to the same labsdb host for all databases. And it creates a connection for each database it needs... [20:11:54] HMMM [20:11:57] dunno, it works for me [20:12:00] why is it broken [20:12:27] ok, its just broken on new data [20:12:34] but, if i map my own table to new data [20:12:35] it works [20:13:15] its tags [20:13:19] not normalized_host this time [20:27:33] ottomata: so i may not have those done today, im now in partman troubleshooting time [20:27:48] it seems we have no partman recipe that expects two different non-sw-raid disks presented [20:27:52] its ok! [20:27:54] and to set / on one and /srv on the other [20:27:55] i'm def not going to touch them today [20:27:59] almost end of my day [20:28:05] also kafka-jumbo1001 has some network issue [20:28:12] hm [20:28:14] where it hits dhcp but doesnt get a free lease but other systems owrk [20:28:14] ok thank sfor trying :) [20:28:22] i'll document it on its own task for troubleshooting [20:28:23] that sounds annoyyying [20:28:34] yeah, its midn boggling, i see dhcpd request hit the install system [20:28:37] from the right subnet [20:28:48] the sytem has a dns entry in the subnet that the install server can dig and host against both the hsotname and ip [20:28:52] so it makes 0 sense. [20:29:26] its setup identically to kafka-jumbo1002, which works fine. except 1001 had some odd ass network port setup issue that cropped up and went away [20:29:32] but we'll figure it out eventually =P [20:31:55] hey ottomata [20:32:07] joal hey [20:32:10] i don't know what's going on [20:32:18] i can't reproduce, i did the same alter procedure you did [20:32:20] on my table [20:32:21] and it works fine [20:32:29] but, now we can't select NEW data :p [20:32:32] but old data works [20:32:52] joal [20:32:56] Failed with exception java.io.IOException:java.lang.RuntimeException: Hive internal error: conversion of string to arraynot supported yet. [20:32:57] i think that's why it fails [20:33:32] ottomata: This is related to normalized host, sure [20:34:03] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3547014 (10Shilad) Yes! I updated [[ https://wikitech.wikimedia.org/wiki/Help:SSH | Help:SSH ]] to indicate that DSA is being phased o... [20:34:04] no, i think it is tags [20:34:06] no? [20:34:11] hmm, i could be wrong. [20:34:12] I don't think so [20:34:24] but, in any case [20:34:36] its working on my otto.wrfix1 table [20:34:39] ottomata: Got that exact same error on normalized host on other partitions [20:34:43] i created table as it was before your changes [20:34:57] then i added 2 partitions, one old, one new [20:35:01] then i altered normalized host [20:35:06] it worked both old and new [20:35:08] then i added tags [20:35:12] worked on both old and new [20:35:21] joal, i wonder if we should try recreating the table :( [20:35:51] when you've done that before, how did you re-add partitions? manually? i'm preparing a new webrequest table now in my db with the same partitions (scripting that) [20:35:55] and seeing if it has the same problem [20:36:31] I didn't do anythiong with partition [20:36:33] 10Analytics, 10Research: productionize ClickStream dataset - https://phabricator.wikimedia.org/T158972#3547018 (10herron) [20:36:37] 10Analytics, 10Operations, 10Ops-Access-Requests, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3547016 (10herron) 05Open>03Resolved Great! Glad to hear it! [20:36:40] joal this time, i know [20:36:42] because you altered [20:36:44] but in the past [20:36:50] we altered too i guess? [20:36:50] hm [20:38:16] as you say: hm [20:38:29] ottomata: trying to drop / reload a partition manually [20:38:39] ok [20:39:33] joal: do you mind if i pause load job again? [20:40:10] ottomata: dropping/recreating works --> So, when hive overwrite a partition, is does so for data only, not for schema !!! [20:40:47] ottomata: no need to pause - We just need to manually drop he prtitions and recreate them (when already present) [20:40:54] ottomata: doing so now for misc [20:40:56] AAHHHHH [20:40:57] great [20:41:01] ok phew [20:41:07] Indeed ! [20:41:08] nice find [20:41:10] phew phew [20:41:19] Man, I'm sorry for that mess [20:42:59] no its ok, its weird! [20:43:37] Rerunning wdqs failed ones [20:45:50] wdqs jobs succeeded - We're safe I think [20:46:02] great [20:46:03] phew [20:46:05] nice joal [20:47:47] I'm going to delete the partitions on uplaod and text that still need to be worked [20:48:56] Done [20:49:29] We'll need to drop/recreate hours 11 and 12 because they are currently running, and we should be good from 13 onward [20:55:27] ok great joal [20:56:45] (03PS1) 10Nettrom: Add page creation configuration and queries [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/373373 [21:00:47] joal, ottomata: (reading backscroll) once the webrequest / normalized_host issue is fixed, could you post a summary to Analytics-l? in particular it would be good to know when the issue started to occur, in case other queries need to be checked and perhaps re-run [21:00:59] also, did it affect any derived tables? https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Other_Hive_tables_that_are_based_on_Webrequest [21:01:42] HaeB for sure, if joal doesn't get to it tomorrow, i'll add it to the https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Changes_and_known_problems_since_2015-03-04 and send email [21:01:58] great thanks [21:01:58] HaeB: it shoudlnt' have affected derived tables, but it may have caused errors in jobs that create the derived tables [21:02:08] that's what i mean ;) [21:02:11] like this wdqs_extract job [21:02:18] those jobs will probably just need re-run [21:02:21] i.e. affect data in those derived tables [21:03:03] if you go to https://hue.wikimedia.org/oozie/list_oozie_workflows/ [21:03:07] and filter for Error [21:03:25] i think that the killed jobs there from today were killed because of this [21:03:35] joal is def taking care of the wdqs_extract ones [21:03:47] i'll check back in tomorrow on the others to see if he hasn't also done those [21:04:06] joal, i've gotta head out for a bit, send me an email if you want me to do or check anything when I get back [21:09:17] HaeB, ottomata: Given the type of error the issue generated, there should not be data corruption, but some jobs might need to be rerun because of failure [21:10:02] We have alarms on all (I think) prod jobs, so we already now for the ones that have failed (wdqs only AFAIK) [21:10:26] I'll post an email tomorrow when everything is confirmed ok and things have settled doen [21:11:04] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3547119 (10RobH) Ok, kafka-jumbo1001 has odd issues. It is confirmed to have the correct MAC address in dhcp, as well as dns is righ... [21:11:51] joal: ok cool [21:12:02] (also, welcome back, belatedly ;) [21:12:19] :) [21:12:33] Thanks HaeB :) [21:12:44] joal ottomata: thank you for rapid identification & fixing of problem! [21:13:32] Thank you bearloga as well for letting us know [21:13:53] 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, 10EventBus, and 2 others: Visualize page create events for all wikis - https://phabricator.wikimedia.org/T170850#3547131 (10Nettrom) @mforns Patch submitted (linked below), and I added you as a reviewer. First time working with Gerrit, hopeful...