[06:27:04] fdans: hola! [06:27:18] just to clarify, are all the failures for pageview-daily_dump something WIP? [06:27:49] elukey: sent a lil email but it's days for which there's no automated agent value [06:27:57] so I need to change that prop in the job [06:28:08] nothing to look at [06:28:12] sorry for the spam [06:28:32] sure sure, from the email I didn't get if it was related, np I was just double checking :) [06:36:32] 10Analytics, 10Analytics-Cluster, 10Product-Analytics: Request admin access to Superset - https://phabricator.wikimedia.org/T255207 (10elukey) Given the limitations of Superset and the need for more permissions, I granted admin access to the two users, let's see how it goes. @cchen Both users are now admins... [06:52:30] 10Analytics, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10elukey) Adding my 2c :) * BGP communities - if pmacct supports adding them to the Kafka JSON message directly, it should be very easy to support from the Analytics poi... [06:55:40] 10Analytics, 10Analytics-Cluster: Upgrade schema[12]00[12] to Debian Buster - https://phabricator.wikimedia.org/T255026 (10elukey) If I got it correctly the 4 schema hosts are VM on ganeti, so we cannot upgrade them in place. We could create 4 new VMs identical but with Buster (schema[12]00[34]), apply to them... [06:58:23] 10Analytics, 10Analytics-Cluster: Upgrade schema[12]00[12] to Debian Buster - https://phabricator.wikimedia.org/T255026 (10MoritzMuehlenhoff) When you do that, please use row B/D in eqiad and row C/D in codfw to better balance out our Ganeti groups. [07:00:16] heya team, good morning [07:01:10] fdans: `automated` agent-type value has been added 2020-04-29 (see https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16) [07:01:27] elukey: I think I know why the denormalize-check has failed - will investigate and propose a fix [07:01:57] joal: yep, had a look in turnilo, i just forgot that we had to specify the keys in the oozie job [07:02:03] not enough coffee ;) [07:02:29] no prob fdans - just mentioning :) Thanks for caring the backfill :) [07:02:43] coffee cheers to you fdans :) [07:06:13] joal: bonjour! All right hope that it is not another weird bug like in the past days [07:06:35] elukey: there is in any case a weird bug - the denormalize job has run manually this month :( [07:06:45] sorry for the bad friday morning news [07:06:57] and cheers coffee to you as well elukey [07:09:18] folks - I'm using a big chunk of cluster resource to spedd up my investigations on mediawiki-history - please let me know of concerns [07:10:33] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Create anaconda .deb package with stacked conda user envs - https://phabricator.wikimedia.org/T251006 (10elukey) >>! In T251006#6275073, @Ottomata wrote: > @elukey we want to include some extra packages in our globally installed anac... [07:33:00] 10Analytics, 10DBA, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10jcrespo) @Kormat do you want to do the finishing cleanup in order to close the ticket? * Confirm the host is healthy, caught up and no error on log. Making sure all monitoring systems wo... [07:37:38] * joal doesn't understand :( [07:40:14] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: `notebook1004.eqiad.wmnet` - notebook1004.eqiad.wmnet (**PASS**)... [07:40:59] elukey: let me know if you want to investigate the oozie issue - no big deal in waiting until next week [07:43:36] joal: maybe in ~30 mins? currently in the middle of renaming notebook1004 to an-scheduler :) [07:43:58] when you wish elukey - as I said, can next wekk, we've already been through a lot of fixing this week :) [08:32:30] joal: I have some time now, if you want we can check.. IIUC nothing history-related is blocked since we'll proceed manually right? [08:33:39] 10Analytics, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10JAllemandou) > Region/site/AS-names - I don't love the Druid lookups idea for two reasons: 1) the data would be augmented only in Druid, not in Hive, so in the future i... [08:33:49] added a commet for you to read elukey --^ :) [08:33:57] pfff - big fingers [08:34:38] ah yes right the realtime indexation :( [08:34:52] never a joy in this work [08:35:08] elukey: actually having a streaming job would be fun! [08:35:33] elukey: I must say I'd rather build a streaming job for this use-case than fix jar-version issues [08:35:38] :) [08:38:07] elukey: there are 2 problems currently with mediawiki-history [08:38:27] elukey: first one is that oozie still prevents the job to succeed (manual version of it worked) [08:38:42] elukey: second one is that there are errors from the checker [08:39:05] the first one is purely technical, the second one I'm investingating to try to unlock the data-release [08:40:40] ah ok let me know if you want to brainbounce or keep working on it, I don't want to make you context switch too many times sorry [08:40:57] no worries elukey - all good [08:41:12] elukey: the oozie issue is: [08:41:13] Caused by: java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 3, local class serialVersionUID = 2 [08:41:32] So, we use the new version, but java is unhappy about it :( [08:42:58] ok so a variation of what we were seeing yesterday, commons-lang3 but this time a different weirdness [08:43:06] correct elukey [08:43:32] so denormalize failed? [08:43:41] yes - I ran it manually [08:43:50] sigh :( [08:44:14] it passed the failure point we were having before, but failed at the end of the job [08:45:35] joal: so something like in refinery-spark we build with commons-lang3 version X, and we run it with version Y with spark for some reason? [08:47:00] elukey: we build the jar using commons-lang3 version 3.5 - and when we run jobs using spark CLI (manual or timers), it works [08:47:19] (refine is now using new version 0.0.129 successfully) [08:48:02] joal: yeah I know, but oozie must use something different then. Enough compatible with 3.5 to skip the timestamp problem, but different enough to yield the incompatibility? [08:48:23] However, some spark jobs started by oozie using v0.0.129 fail [08:48:32] actually, maybe all, as there is only one as of now [08:48:58] which ones? [08:49:13] elukey: mediawiki-history-denormalize :) [08:49:40] elukey: the only restarted spark oozie job using 0.0.129 is that one [08:49:44] all other are using older jars [08:50:40] elukey: I'm gonna launch an oozie job using the previous jar - check if it works (and also check if the data issue we encouter is present there) [08:50:45] https://github.com/apache/commons-lang/commit/bea1ae92aa52a985f8c171c6e17ff7fc4aa61fe4 seems to be the change that updated serialVersionUID [08:50:50] that is 3.5 [08:51:05] so oozie is stupid and still not using 3.5? [08:51:16] I don't know :( [08:51:22] possible [08:51:30] elukey: ok for me to try a new job? [08:51:57] sure sure [08:58:17] we may also have another problem [08:58:26] elukey@an-coord1001:~$ ls -l /mnt/hdfs/user/oozie/share/lib [08:58:26] total 4 [08:58:26] yes? [08:58:26] drwxr-xr-x 13 oozie hadoop 4096 Jun 16 15:12 lib_20200616151107 [08:58:37] our dear friend dropped the old lib_ from hdfs [08:58:56] elukey: I think this should be fine [09:00:28] elukey: don't you think? [09:02:15] joal: I don't recall, is there any possibility that a job still uses the old lib and may fail if it is not there? [09:02:32] elukey: runnig job - maybe, new job I don't think [09:02:44] elukey: also, having restarted oozie yesterday, I think we're safe [09:04:14] so it is transparent for jobs, oozie handles it [09:04:15] okok [09:04:50] I think so yes - or I least it should handle it gracefully (meaning we shouldn't have to restart it) [09:05:12] elukey: Let's still have an eye on error emails ;) [09:09:01] elukey: I have triple checked the failed denorm-job config - it used the new sharelib - so it must be related to sonething else [09:22:44] interesting elukey - when building refinery-job, I have this line: [09:22:47] [INFO] Including org.apache.commons:commons-lang3:jar:3.3.2 in the shaded jar. [09:22:55] * elukey cries in a corner [09:23:33] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-scheduler1001.eqiad.wmnet'] ` The log can be foun... [09:26:30] elukey: trying to understand why [09:31:00] elukey: I moved to maven-central instead of the archiva-mirror and got 3.5 :( [09:31:39] * joal is angry at archiva :( [09:32:21] I checked the console logs of the build before the archiva change, and found [09:32:24] 19:02:35 [INFO] [INFO] Building Wikimedia Analytics Refinery Jobs 0.0.127 [09:32:27] 19:03:36 [INFO] [INFO] Including org.apache.commons:commons-lang3:jar:3.5 in the shaded jar. [09:32:30] just to double check [09:32:44] ack elukey [09:32:54] let's see the last one!!! [09:33:54] in archiva we have multiple versions https://archiva.wikimedia.org/#artifact~mirror-maven-central/org.apache.commons/commons-lang3 [09:34:19] not sure why mvn picks 3.3.2 [09:34:31] I have no idea :( [09:36:35] does mvn dependency:tree change when switching repos? [09:37:09] I have no idea :) [09:41:22] if you have it handy to check it would be helpful (I think) [09:41:51] willtry [09:44:36] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-scheduler1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-scheduler1001.eqiad.wmnet'] ` [09:49:07] 10Analytics, 10DBA, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) >>! In T256966#6276329, @jcrespo wrote: > * Confirm the host is healthy, caught up and no error on log. Making sure all monitoring systems work as intended (prometheus, tendril, .... [09:52:52] elukey: just checked the last build - in refinery-job 3.5 is included - however we noqw include refinery-hive in job, and in hive it is 3.2.2 [09:52:59] This might be the issue [09:53:12] I need to drop to get Lino for lunch, will continue investigation when back [09:54:55] ah [09:55:33] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-scheduler1001.eqiad.wmnet'] ` The log can be foun... [09:59:04] 10Analytics, 10DBA, 10User-Kormat: dbstore1005 s8 mariadb instance crashed - https://phabricator.wikimedia.org/T256966 (10Kormat) 05Open→03Resolved @elukey confirms that analytics can query again, and i've removed the temporary grants files from my home dir. [10:15:08] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-scheduler1001.eqiad.wmnet'] ` and were **ALL** successful. [10:15:19] an-scheduler1001 ready! [10:19:39] 10Analytics-Cluster, 10Analytics-Radar, 10Operations, 10ops-eqiad: Renamed notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T256397 (10elukey) [10:23:19] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 (10elukey) All done! The new an-scheduler1001 is currently with a generic puppet role, we'll switch to something more precise when the time comes. [10:23:26] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Repurpose notebook100[3,4] - https://phabricator.wikimedia.org/T256363 (10elukey) [10:35:11] * elukey lunch! [13:14:23] (03PS1) 10Joal: Update commons-lang3 version to 3.5 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/609423 [13:17:14] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) >>! In T234826#6275100, @jcrespo wrote: > Both servers will need server-id setup (that is why we set it up with... [13:24:04] for my opsies: https://www.commitstrip.com/en/2020/07/03/talk-to-me/?setLocale=1 [13:40:46] :D :D [13:40:57] speaking of ops, I see a new patch! [13:41:07] elukey: testing [13:42:09] joal: no version in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/609423/1/refinery-job/pom.xml ? [13:42:33] hehe :) the magic of subproject - no version means 'use your parent definition' [13:44:10] ack [13:45:03] elukey: before my last jar bump for denormalize (move to 0.0.129), we were using 0.0.115 - And the addition of refinery-hive to refinery-job happens with release 0.0.123 - I may have a lead :) [13:45:49] elukey: the thing I don't understand however is - how the heck does the job work when used in CLI????? [13:49:53] no idea [15:10:41] joal: the cumin cookbook to stop the cluster worked nicely in test [15:11:09] enters safe mode, saves a hdfs fsimage snapshot, shutdowns all the daemons gracefully [15:11:21] now I need to test if the upgrade works [15:13:30] this is extremely awesome elukey :) [15:14:20] in the bright future we might be able to upgrade to bigtop simply watching spicerack doing it [15:15:40] the magic of automation :) [15:16:09] it takes a ton of time but it guarantees that you don't mess up commands [15:16:16] I'll still keep the tesla-autopilot warning in mind though: keep elukey's hands on the streeing wheel :) [15:17:16] the thing that forced me to create cookbooks was not trusting me when doing delicate procedures, so easy to mess up [15:17:19] :D [15:17:41] I definitely understand - too many steps, order is important etc [15:18:19] elukey: my tests are still running, and I have great hopes (important before weekend :) [15:19:04] super great joal [15:27:23] joal: here if you need a 2nd pair of eyes to check issues with reduced [15:30:56] 10Analytics, 10Analytics-Cluster, 10Project-Admins: can we rename analytics-cluster to analytics-clusters (plural)? - https://phabricator.wikimedia.org/T257059 (10Nuria) [16:00:07] Hi nuria - thanks :) Let's do that after meeting [16:16:08] 10Analytics, 10Event-Platform, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Workboards (Team 2), and 2 others: > 2% of API wall time spent generating UUIDs - https://phabricator.wikimedia.org/T222966 (10Krinkle) [16:41:39] !log Rerun mediawiki-history-check_denormalize-wf-2020-06 after having cleaned up wrong files and restarted a job without deterministic skewed join [16:41:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:42:23] joal: if you don't need me around I'd log off, but I can stay more if you want to brain bounce etc.. [16:42:54] all good elukey - I think the patch I sent about commons-lang3 has done the job (confirmation in a few hours) [16:43:04] (03CR) 10Nuria: Update commons-lang3 version to 3.5 (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/609423 (owner: 10Joal) [16:43:46] elukey: also, a run of user-history with a previous jar (0.0.115) has no data-corruption, so I'm rerunning the full stuff with that jar (change is skewed join deterministic strategy) [16:44:13] elukey: I have restarted the denormalized-check job, so once my manual job is done, data should continue flowing [16:44:28] elukey: Have a good weekend :) [16:44:52] joal: ya, +1 , i think it will take a bit to see why tehchanges with skewed join "eat up" some events [16:44:55] so we have on one side the common-lang3 dependency weirdness, and on the other side a data corruption due to a code change? [16:45:05] elukey: it is not corruption [16:45:06] yessir :) [16:45:18] deletion is more accurate [16:45:31] elukey: some events are "missing" [16:45:49] okok apologies, that part of the refiney source is a black box to me, I was reading and trying to summarize :) [16:45:51] right, deletion of events in the history that should not happen thus rendering [16:46:14] (03CR) 10Joal: Update commons-lang3 version to 3.5 (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/609423 (owner: 10Joal) [16:46:19] elukey: teh algorithm is outputing say 1000 events [16:46:27] elukey: ehen it should output 1200 [16:46:30] *when [16:46:44] joal: and about the dependency issue, do you know what happened? Just not a unified version across all refinery-* jars ? [16:46:55] I think that's it yes [16:47:00] nuria: yep I got that part :) [16:47:19] elukey: details - commons-lang3 3.5 was in refinery-job - all good [16:47:34] But, in refinery-hive, commons-langs3 3.3.2 [16:47:47] joal: thanks a lot for the fix, I know it is not really fun :( [16:48:05] And in refinery version 0.123 we imported refinery-hive in refinery-job - MESS! [16:48:10] but probably before using the "mirrored" repo we were hiding dependency issues like this one [16:48:33] no worries elukey - actually not related to mirrors, but to the new import of refinery-hive [16:48:51] that was caused by? [16:49:10] we were using jar version 115 to run denormalize, so it was good - I tried using 128 (the one before the new release) - failed with the same error [16:49:27] elukey: we want to use refinery-hive UDFs in spark jobs [16:49:54] ah ok a change that we did [16:50:01] not some side effect [16:50:01] and the commons-lang3 dependencies in our pom hierarchy was not correctly set [16:50:21] I mean "we imported refinery-hive in refinery-job" [16:50:47] ah wait now I get the picture okok [16:51:05] so we did want to import refinery-hive in job, but the two versions collided [16:51:06] a kinda side effect, cause importing refinery-hive shouldn't have had an impact on denormalize, but yes, one of our change was the root cause (or more precisely, the action allowed to see that dependencies were not aligned) [16:51:26] yep - 2 different version of commons-lang3 in the same shaded jar [16:51:32] okok mine is not an attempt to blame anybody, just to have a good picture [16:51:39] :) [16:51:44] no problemo :) [16:53:31] nuria: if you wanna see the change about joins: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/608567 [16:54:47] all right logging off, but if you folks need me ping me on the phone and I'll rejoin [16:55:00] elukey: no need, i think we are good [16:55:08] ack elukey - have a good weekend :) Thanks a lot for all the help [16:55:22] nuria: don't say those famous words please :D [16:55:38] elukey: I KNOW [16:55:41] FIRE FIRE [16:56:05] (03CR) 10Nuria: [C: 03+2] Update commons-lang3 version to 3.5 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/609423 (owner: 10Joal) [16:56:08] Andrew is away before the weekend, it is the perfect way to invoke the curse [16:56:19] :D [16:56:20] o/ [16:57:47] nuria: reading the patch I found the bug :) [16:57:54] very nasty one [16:58:55] (03CR) 10Joal: Make mediawiki_history skewed join deterministic (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/608567 (https://phabricator.wikimedia.org/T255548) (owner: 10Joal) [16:59:00] see -^ [16:59:21] Ok, now that everything is started I'm gonna stop for diner - Will be back after to double everything is alright [17:00:39] (03Merged) 10jenkins-bot: Update commons-lang3 version to 3.5 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/609423 (owner: 10Joal) [17:10:55] (03PS1) 10Joal: Fix mediawiki-history skewed join bug [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/609465 (https://phabricator.wikimedia.org/T255548) [17:44:20] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - https://phabricator.wikimedia.org/T257071 (10A455bcd9) [18:38:18] 10Analytics, 10Analytics-Cluster, 10Project-Admins: can we rename analytics-cluster to analytics-clusters (plural)? - https://phabricator.wikimedia.org/T257059 (10Aklapper) @Nuria: For future reference, please don't assign tasks to me as I might be on vacation and as there are other folks who could also perf... [18:38:55] 10Analytics, 10Analytics-Cluster, 10Project-Admins: Rename Analytics-Cluster to Analytics-Clusters (plural) and make it a subproject of Analytics - https://phabricator.wikimedia.org/T257059 (10Aklapper) [18:40:44] PROBLEM - Hive Server on an-coord1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:43:34] wow [18:43:39] this is not cool [18:43:52] by any chance, would you be around elukey u? [18:43:56] ot ottomata ? [18:44:50] ok nobody seems here [18:44:55] asking in prod chan [18:48:06] RECOVERY - Hive Server on an-coord1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hive.service.server.HiveServer2 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive [18:48:30] great :) [18:51:03] !log restart virtualpageview-hourly-wf-2020-7-3-15 after hive-server failure [18:51:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:52:10] !log restart data_quality_stats-wf-event.navigationtiming-useragent_entropy-hourly-2020-7-3-15 after have server failure [18:52:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:53:49] !log restart webrequest-load-wf-text-2020-7-3-17 after hive server failure [18:53:50] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:05:41] !log kill manual execution of mediawiki-history to save an-coord1001 (too big of a spark-driver) [19:05:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:13:22] !log restart mediawiki-history-denormalize oozie job using 0.0.115 refinery-job jar [19:13:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:20:37] !log restart failed webrequest-load job webrequest-load-wf-text-2020-7-3-17 with higher thresholds - error due to burst of requests in ulsfo [19:20:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:25:49] 10Analytics, 10Analytics-Cluster, 10Project-Admins: Rename Analytics-Cluster to Analytics-Clusters (plural) and make it a subproject of Analytics - https://phabricator.wikimedia.org/T257059 (10Nuria) >This will also remove tasks tagged with yesyes