[05:41:41] 10Analytics: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) [06:52:48] goood morning [07:00:51] all right I am deploying analytics-hive's cred to an-coord1001 [07:01:05] update all the clients and then rebooting the stat100x boxes for kernel upgrades [07:07:34] ah interesting, all the refine jobs were pointing to an-coord1001 [07:07:41] so before restarting I'll wait a bit [07:08:22] also hive-site.xml in hdfs has just been updated [07:08:28] let's see if any error pops up [07:09:36] I am testing beeline and spark on stat1004, all good [07:10:20] show schemas in the presto cli works [07:23:37] !log move all analytics clients (spark refine, stat100x, hive-site.xml on hdfs, etc..) to analytics-hive.eqiad.wmnet [07:23:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:27:49] !log reboot stat100[4-8] (analytics hadoop clients) for kernel upgrades [07:27:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:04:51] \o/ [08:05:23] I have a work medical appointment this morning, will be here once done [08:05:33] joal: bonjour! Let's failover when you are back :) [08:05:36] I hope it all works as exepcted elukey [08:05:44] Yessir! [08:05:50] and elukey : Hi2~!! [08:05:54] so far nothing exploded :D [08:06:03] great [09:00:40] * elukey coffee! [09:53:03] Back [09:55:14] elukey: failover? [09:55:50] joal: lemme restart hive on an-coord1001 first [09:56:03] !log restart hive daemons on an-coord1001 to pick up analytics-hive settings [09:56:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:00:07] PROBLEM - Hive Metastore on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [10:01:06] yep some issues [10:01:17] Mwarf :( [10:02:00] weird [10:03:34] so in theory nothing should fail now [10:03:40] I mean job-wise etc.. [10:03:59] I have some problems with mysql settings, trying to fix them [10:04:06] ack! [10:14:13] RECOVERY - Hive Metastore on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hive.metastore.HiveMetaStore https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [10:15:03] this is manually fixed, I am sending a follow up patch [10:18:11] all right we should be ok [10:20:36] joal: https://gerrit.wikimedia.org/r/c/operations/dns/+/651456/ [10:20:47] reading elukey [10:21:22] elukey: can you explain please? [10:21:30] I;'m sorry I don't understand :) [10:21:41] joal: it is the failover patch, the CNAME is directed to an-coord1001 [10:21:54] it has a 300s of TTL so it should be quick [10:22:30] Ack [10:24:30] joal: ok to go? [10:24:38] Yes! [10:25:21] !log failover analytics-hive.eqiad.wmnet to an-coord1001 [10:25:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:28:50] I already see map-reduce jobs launched [10:28:59] on an-coord1001 [10:29:01] \o/ [10:29:37] if this works it is a good pre-holidays gift :) [10:30:12] Yes :) [10:33:09] ah ok presto doesn't work but it is my faul, patching [10:36:53] !log restart presto coordinator to pick up analytics-hive settings [10:36:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:41:58] working noW! [10:42:04] spark2-shell also works afaics [10:42:30] joal: lemme know if you see anything weird [10:42:35] buuut it looks good so far :) [10:42:50] \o/ [10:46:58] * elukey dances [10:49:22] * joal gives a virtual beer to elukey :) [10:50:42] * elukey drinks with joseph virtually [11:01:17] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet - https://phabricator.wikimedia.org/T268028 (10elukey) The first failover from an-coord1002 to an-coord1001 happened via DNS CNAME change, no issue raised \o/ This task is done finally! [11:02:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move oozie's hive2 actions to analytics-hive.eqiad.wmnet - https://phabricator.wikimedia.org/T268028 (10elukey) a:03elukey [11:32:20] * elukey lunch! [14:20:34] I have started https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Coordinator [14:20:38] to collect all the info [14:34:52] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Isaac) > Sorry for the radio silence! I just finished up final exams, so I'm now freed up to make more progress on this.... [14:59:42] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Yes!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/650034 (https://phabricator.wikimedia.org/T257412) (owner: 10Joal) [15:00:27] !log stopping superset server on analytics-tool1004 [15:00:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:00:59] heya teammm [15:01:23] joal, ottomata: can I deploy this today? can be merged? [15:01:45] I can give a final review if you wish [15:04:43] mforns: not sure what patch you're talking about :) [15:04:55] xDDDD I forgot the link! [15:05:14] joal: this one: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/646828 [15:06:56] hmmm [15:07:03] mforns lets wait til after holildays [15:07:07] no hurry [15:07:20] ok, thanks! [15:14:47] thanks ottomata - trying to juggle to many balls mforns :S [15:15:43] np! [15:22:18] ottomata: but I will deploy https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/646808 ok? [15:22:49] yes joal hm, FYI, I'm glad I merged ^ but perhaps that is also something that should not go out after holidays [15:23:01] it should be fine but it does have the potential to break pageviews :/ [15:23:22] Very true ottomata [15:23:24] hm [15:23:31] whatever you guys prefer [15:24:00] I see no reason for wish it should break pageviews indirectly, and if it breaks pageview we can revert - Is that strategy ok with you mforns ? [15:24:10] yes! [15:24:41] we still have thid afternoon and tomorrow to revert and make sure it doesn't break [15:25:25] ok [15:25:30] proceed then! [15:27:10] (03PS1) 10Ottomata: Add Refine TransformFunction to remove canary events [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/651542 (https://phabricator.wikimedia.org/T251609) [15:31:18] (03CR) 10jerkins-bot: [V: 04-1] Add Refine TransformFunction to remove canary events [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/651542 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:36:04] 10Analytics, 10Analytics-Kanban: Can't use custom conda kernel in Newpyter within PySpark UDFs - https://phabricator.wikimedia.org/T269358 (10Isaac) > does https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#pyspark_and_external_packages help at all? You could certainly pass those args in a cus... [15:36:48] ottomata, joal: are there any refinery_jar_version bump ups needed for https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/646808 and https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/638040 [15:36:55] webrequest load bundle [15:36:59] refine? [15:39:09] 10Analytics, 10Analytics-Kanban: Can't use custom conda kernel in Newpyter within PySpark UDFs - https://phabricator.wikimedia.org/T269358 (10Ottomata) > If it's easy, I can create a task Its easy enough, and already huge (as you noticed) so no harm in adding more packages. > The zipped conda environment i... [15:40:02] For 646808 I guess so? But i don't know exactly which ones. it shoud be a no-op for isPageview, except possiblyl for some performance enhancements [15:40:18] webrequest_load I suppose yes, it uses is_pageview UDF right? [15:40:24] is there a pageviews oozie job? [15:40:41] i assume yes, but maybe it just uses the is_pageview flag on webrequest table [15:45:57] ottomata: yes, that's my understanding as well [15:46:13] ottomata: and for Refine? should I bump up in puppet? [15:46:24] no no need until we merge more stuff [15:46:27] ok ok [15:46:31] thanks! [15:53:40] hi a-team, the new superset host is live at https://superset.wikimedia.org, please check your charts and let me know if anything is weird! [15:54:03] joal / ottomata: do you know anything about your java builds? I'm doing some cleanup in CI config and trying to align what we do and have a few questions [15:54:04] k! [15:54:27] gehel: maybe a little bit, joal might know more? [15:54:34] ask away! [15:54:50] 1) do you care about email notification of releases? And about irc notifications? [15:55:21] yes i think so, email is nice, irc maybe nice too [15:56:00] 2) I remember that at some point you were specifying the version number for the releases manually, but it seems to have disappeared, was that on purpose? [15:56:12] (yes, I could dig into the commit logs) [15:56:55] gehel: we specify them via mvn release plugin i think, which we have integrated with Jenkins [15:57:35] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source [15:59:40] Oh, I now understand, you let maven choose the next version number, but the "update-jar" job needs to have it passed manually [16:00:07] (03PS1) 10Mforns: Bump up refinery-source version for webrequest load bundle to 1.4.2 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/651548 [16:00:59] yes i think that's right. [16:01:33] ok, make sense [16:01:47] fyi, I worked on a few patches with hashar: https://gerrit.wikimedia.org/r/q/topic:%22java-cleanup%22 [16:01:56] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/651548 (owner: 10Mforns) [16:02:35] the ones that are already merged only affect WDQS, but the next one is trying to use the same template for releasing WDQS and analytics projects [16:02:36] gehel: we still haven't fixed our shaded jar nonsense [16:02:44] (03PS2) 10Ottomata: Add Refine TransformFunction to remove canary events [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/651542 (https://phabricator.wikimedia.org/T251609) [16:02:53] he, he, he... ping me if I can help you on that! [16:03:25] hehe, we know how to do it, its just annoying; need to change that update-jars and all the references in puppet too [16:03:56] yeah, renaming artifacts is always a pain... [16:03:56] ok cool gehel that just aligns basically the release notifications bit? [16:04:08] what does jdk.net.URLClassPath.disableClassPathURLCheck=true do? [16:05:26] it's a workaround for an incompatibility between surefire and recent versions of Java 8. I think we are using a more recent version of surefire that does not have the issue [16:05:34] (03PS1) 10Mforns: Update changelog.md for v0.0.142 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/651549 [16:06:12] you might or might not still need it, we should check and cleanup if we can [16:06:33] elukey: nice job with the an coord stuff! [16:06:35] very cool! [16:06:40] here's a bad idea: [16:07:11] if we could be 100% sure that only one metastore and hive server can make writes to db [16:07:24] each metastore could always be configured to write to its local db [16:07:51] ottomata: I don't trust hive to that point but we can check if possible :D [16:08:00] and your failover process could include changing DNS and switching eplication detection [16:08:05] repllication direction* [16:08:06] sorry ^ [16:08:25] elukey: i don't trust it either [16:08:37] :) [16:08:42] (03CR) 10Mforns: [V: 03+2 C: 03+2] "Merging for deployment train" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/651549 (owner: 10Mforns) [16:10:01] ottomata: yes yes I am adding pictures so we can all see what it is the current status quo and keep progressing, ideally I see in the bright future a completely automated failover with few DNS changes or even one [16:10:05] it would be super great [16:10:13] :) [16:11:38] ottomata: also I want to point out that SREs can draw nice graphs, as opposed to what fdans says about us [16:12:29] graphs? I said y'all can't draw nice giraffes! [16:13:32] fdans: hola :D [16:13:54] I am particularly proud of the selection of the metastore logo [16:14:04] * elukey coffee [16:16:48] haha [16:30:42] !log Deployed refinery-source v0.0.142 [16:40:44] Heya mforns and ottomata - I was in meeting [16:40:51] hey joal :] [16:40:56] mforns: About jobs: indeed webrequest jar needs to be bumped [16:41:03] already did [16:41:19] I think the only job to restart is webrequest load bundle right? [16:41:31] mforns: I'd also like for us to manually check that refine works with the maven-versions bump I did - I think it cou;d be the one the most impacted [16:41:46] mforns: correct, webrequest-load bundle [16:42:01] joal: ok, if ottomata is OK, I will bump up Refine jar version in puppet [16:42:23] mforns: About the refine check, let's see if ottomata is ok with us bumping the jar version today so that we actually now, even if more changes are planned to come [16:42:29] !log Deployed refinery-source v0.0.142 [16:42:38] mforns: you read my mind, and type faster :) [16:42:51] +1 that's fine! [16:42:52] hehe [16:43:00] ok! bumpinggggg up! [16:43:01] mforns: My apologize for not documenting the ehterpad train :( [16:43:15] no no, it's fine, I check it anyway [16:45:10] stashbot: hello? [16:45:17] heh [16:51:39] bd808: I assume it's the nfs stuff upsetting it? [16:51:56] Ooh it just worked in -operations [16:52:50] It has been bouncing up and down. I can't get a stable shell to check the error logs but yes it is very likely related to the NFS switch in Toolforge. [16:54:09] The WMCS folks are working on the NFS things. It sounds like they may have some routing issues with the secondary server or the firewall rules that allow client to talk to it. [16:54:30] (03PS3) 10Ottomata: Update junit and netty versions for github security alert [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/651553 (https://phabricator.wikimedia.org/T237774) [16:55:45] bd808: aren't you part of them folk? [16:55:57] * RhinosF1 got confused as you said it like they were another team [16:57:22] RhinosF1: its... complicated. I'm WMCS adjacent these days but not exactly on that team. [16:57:36] bd808: when did that happen? [16:57:48] * RhinosF1 has been away for about 9 weeks unwell [16:58:33] mforns: ok to merge? [16:59:24] elukey: I think so! [16:59:39] Eh. Your meta page says June. I must have just never noticed. [16:59:57] RhinosF1: about 6 months ago now. Nicholas Skaggs took my job as manager of the WMCS team and I moved to the parent Technical Engagement team as a software engineer. I'm still meeting with the WMCS team, but I'm also meeting with 2 other teams that I'm now a part of. :) [17:00:32] gehel: https://www.mediawiki.org/wiki/Manual:Coding_conventions/Java says to use Gogole Java Style guide...which apparently wants 2 spaces? [17:00:34] do we do that? [17:00:36] !log Deployed refinery as part of weekly train (v0.0.142) [17:00:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:00:40] we do 4 spaces, right? [17:00:41] trying again... [17:00:46] it worked! [17:01:37] bd808: ah, I saw a new person joined. [17:01:53] If you anyone finds the convo annoying, do tell me to shut up [17:02:36] mforns: milimetric standup! [17:02:36] mforns, milimetric standuuuppp [17:02:59] uooop! [17:21:55] ottomata: yooohoo [18:02:30] fdans: can you link me to the post-mortem notes, please? I will add them to the incident doc. [18:05:17] elukey or ottomata: can you merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/651559 as part of deployment train, please? [18:05:23] Thanks a lot for that mforns --^ :) [18:05:44] thank you for remembering me! [18:05:59] I had forgotten [18:07:11] too many things to think of mforns :) [18:07:16] !log restart hive server on an-coord1002 (current standby - no traffic) to pick up the new config (use the local metastore as opposed to what it is pointed by analytics-hive) [18:07:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:08:15] mforns: running puppet now on an-launcher1002 [18:08:28] elukey: thanks a lot! [18:08:43] mforns: it is the usual 5 euros so don't thank me! :P [18:08:49] mforns: check your mail, just shared with everyone [18:09:08] elukey: xD [18:09:15] fdans: thanksss! [18:10:05] fdans: btw, thank you for taking these notes! [18:13:53] !log failover analytics-hive.eqiad.wmnet to an-coord1002 (to allow maintenance on an-coord1001) [18:13:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:14:00] razzi, ottomata --^ [18:14:11] I missed a config and now I am trying to fix :) [19:00:42] ottomata: we (Search Platform) use 4 spaces [19:00:57] But I don't think we enforce it [19:01:35] ah [19:01:45] And I don't think we should. Our code looks good enough, let's not bring additional linters that will get it wrong often enough [19:02:57] Maybe we should update the docs. We only have very few rules about style, we're definitely not enforcing Google style [19:03:39] But we have a lot of static analysis that cares about slightly deeper issues than style [19:06:52] the Product Analytics team was working on a style guide, and i was arguing for 4 spaces everywhere possible, they linked to that page and said our java conventions stated 2 spaces [19:06:55] context ^ [19:06:56] :) [19:07:15] addshore: hellllooooo [19:07:16] yt? [19:13:15] I would support you on 4 spaces [19:15:07] I also think that our Java code is looking good enough and that we shouldn't spend too much time working on a style guide [19:16:31] If there are specific things that we think should be improved, we should add rules to check them, but I'm quite opposed to starting a style guide from scratch [19:18:10] If Product Analytics is starting to write more Java code, we should make sure they use the discovery parent pom. The static analysis in there has a lot more value than having a style guide [19:21:09] If there is an ongoing discussion, I'd be happy to be part of it ! [19:26:47] about https://www.mediawiki.org/wiki/Manual:Coding_conventions/Java : I find it interesting that it states: "This page describes the coding conventions used within files of the MediaWiki codebase written in Java." Is there any Java code actually in the MediaWiki codebase? [19:32:21] ottomata: helooooo [19:35:39] !log restart hive daemons on an-coord1001 to pick up new settings [19:35:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:36:46] all right it seems that it has worked fine, I am doing the an-coord1002 -> an-coord1001 failover tomorrow morning [19:38:09] joal: ok now the hive servers are fetching from the local metastores :) [19:39:03] addshore: i am currently curious about wikidata's data storage model, and also mediawiki docker [19:39:16] you seemed to perhaps be the guy to talk to :) [19:39:22] am I wrong? [19:39:24] :D [19:40:40] ottomata: I clarified a bit https://www.mediawiki.org/wiki/Manual:Coding_conventions/Java , feel free to review and revert! [19:42:25] * elukey afk! [19:42:32] thank you gehel :) [19:57:48] (03CR) 10Joal: [C: 03+1] "\o/" (031 comment) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/649884 (https://phabricator.wikimedia.org/T268809) (owner: 10Fdans) [20:00:40] ottomata: it sounds like I may be able to help :D [20:01:01] 10Analytics, 10Analytics-Kanban: Can't use custom conda kernel in Newpyter within PySpark UDFs - https://phabricator.wikimedia.org/T269358 (10Isaac) 05Open→03Resolved > Its easy enough, and already huge (as you noticed) so no harm in adding more packages. Sounds good -- in the new year, I'll likely come ba... [20:01:25] addshore: got a moment for a quick vidchat? [20:01:44] (03CR) 10Joal: [C: 03+1] "LGTM - Thanks !" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/651553 (https://phabricator.wikimedia.org/T237774) (owner: 10Ottomata) [20:01:46] I'm in another call right now and will be for the forseeable hours D: [20:01:55] oOoOoK! [20:02:49] ottomata: I'll be interested to talk with you about your ideas on the graph-storage matters if ou wish :) [20:03:42] joal: i actually know very little about how wikibase and mediawiki and data are intertwined, you got the facts on that? if so i wanna hear em! [20:04:07] its failry straight forward really, wikibase makes JSON, and stores it where wikitext is stored in mediawiki [20:04:10] that is all :P [20:04:26] I have less context than addshore for sure, but I have some ideas - Let's talk and see if addshore confirms any of what I have in miond :) [20:04:29] general interaction is load Json, edit json, resave json [20:04:38] huh json. so the data model is just in the json objects? [20:04:38] I got that correct then :) [20:04:56] ottomata: batcave? [20:04:57] what do you mean by data model in that context? [20:05:00] ottomata: and if that sounds scary to you, you're not alone ;) [20:05:08] MWHAHAHA :) [20:05:15] going to bc [20:05:24] if anyone wwants to join [20:05:25] that's here [20:05:25] https://meet.google.com/rxb-bjxn-nip [20:05:40] addshore: how are links between entities stored? [20:06:34] I mean, the references to what is linked to / from is stored in the json, that is indexed in elastic and blazegraph, there are also "links" in the mediawiki links table that we fill, but we don't use them for things really (outside of mediawiki) [21:04:24] Gone for tonight :) [21:57:04] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10razzi) Superset is now running on an-tool1010, so analytics-tool1004 can be decommissioned. Next up is to migrate turnilo. [22:03:02] (03PS1) 10Milimetric: [WIP] Add log-entry create schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/651635 (https://phabricator.wikimedia.org/T263055) [22:04:48] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add log-entry create schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/651635 (https://phabricator.wikimedia.org/T263055) (owner: 10Milimetric)