[00:53:08] (03PS2) 10Milimetric: [WIP] Use page move events to improve joining to entity [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/594428 (https://phabricator.wikimedia.org/T249773) [00:54:02] (03CR) 10Milimetric: "still have to test, so still [WIP], but I appreciate the early catches." (033 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/594428 (https://phabricator.wikimedia.org/T249773) (owner: 10Milimetric) [04:20:55] RECOVERY - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_delayed on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:23:27] RECOVERY - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:50:20] 10Analytics, 10Operations: Analytics1060 unresponsive - https://phabricator.wikimedia.org/T251973 (10Marostegui) [04:55:30] ebernhardson: tried to change ownership with anayltics user but couldn't maybe the hdfs user is needed here (cc elukey ) [06:03:23] nuria: correct, hdfs is needed [06:03:53] I see that Erik also owns all the subdirs [06:05:03] RECOVERY - Check the last execution of refine_sanitize_eventlogging_analytics_delayed on an-launcher1001 is OK: OK: Status of the systemd unit refine_sanitize_eventlogging_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:05:17] !log execute hdfs dfs -chown -R analytics-search:analytics-search-users /wmf/data/discovery/search_satisfaction/daily/year=2019 [06:05:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:05:31] ebernhardson: done! [06:07:17] Going to be afk for most of the morning, but available on the phone if needed o/ [06:52:53] 10Analytics: Javascript-less Wikistats - https://phabricator.wikimedia.org/T251979 (10fdans) [07:39:29] 10Quarry, 10DBA, 10Data-Services: Unable to use force index on replicas (Key 'PRIMARY' doesn't exist in table 'page') - https://phabricator.wikimedia.org/T251980 (10RhinosF1) [07:40:57] 10Quarry, 10DBA, 10Data-Services: Unable to use force index on replicas (Key 'PRIMARY' doesn't exist in table 'page') - https://phabricator.wikimedia.org/T251980 (10Marostegui) >>! In T251980#6111712, @Akeron wrote: > I used https://quarry.wmflabs.org to test those queries on enwiki_p. > > It is very penali... [07:54:56] 10Quarry, 10Data-Services: Unable to use force index on replicas (Key 'PRIMARY' doesn't exist in table 'page') - https://phabricator.wikimedia.org/T251980 (10Marostegui) [08:50:16] interesting JA008: File does not exist: hdfs://analytics-hadoop/user/oozie/share/lib/lib_20200204183338/hive2/libfb303-0.9.3.jar [08:51:07] this is the pageview hourly coord [08:52:30] elukey@stat1005:~$ ls -l /mnt/hdfs/user/oozie/share/lib/ [08:52:30] total 4 [08:52:31] drwxr-xr-x 13 99 hadoop 4096 Apr 29 07:24 lib_20200429072322 [09:01:13] that comes from [09:01:13] /var/log/puppet.log.7.gz:224:Apr 29 07:24:24 an-coord1001 puppet-agent[169992]: (/Stage[main]/Cdh::Oozie::Server/Kerberos::Exec[oozie_sharelib_install]/Exec[oozie_sharelib_install]/returns) executed successfully [09:04:37] !log execute oozie admin -sharelibupdate on an-coord1001 [09:04:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:05:26] !log re-run pageview-hourly coordinator 2020-5-6-6 after oozie shared lib update [09:05:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:07:59] !log re-run data quality coordinators for 2020-5-6-5/6 after oozie shared lib update [09:08:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:08:54] !log re-run mediarequest coordinator for 2020-5-6-7 after oozie shared lib update [09:08:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:09:44] !log re-run mediacounts coordinator for 2020-5-6-7 after oozie shared lib update [09:09:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:10:21] !log re-run aqs-hourly coordinator for 2020-5-6-7 after oozie shared lib update [09:10:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:11:06] hi elukey is wdqs an analytics service? [09:11:09] !log re-run learning features actor coordinator for 2020-5-6-7 after oozie shared lib update [09:11:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:11:16] jbond42: nope! Search's [09:11:20] ack thanks [09:11:22] is it exploding? [09:12:02] no i just noticed that wdqs-updater is failing to start in wdqs1009 [09:12:31] massive stack trace with some spark-query at the begining [09:12:42] jbond42: ahhh okok there was a problem yesterday, that is a test host so nothing horrible, but worth to follow up with Search [09:13:11] !log re-run apis coordinator for 2020-5-6-7 after oozie shared lib update [09:13:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:13:47] ahh ok thanks ill ping them [09:22:58] 10Analytics, 10Operations, 10Traffic, 10Patch-For-Review: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 (10fgiunchedi) Chiming in with two cents and my Prometheus hat: I agree with @ema that none of the options are great unfortunately. Rephrasing to make sure I understand... [09:24:30] !log re-run virtualpageview coordinator for 2020-5-6-5 after oozie shared lib update [09:24:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:25:18] !log re-run projectview coordinator for 2020-5-6-5 after oozie shared lib update [09:25:21] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:25:50] hopefully all coords restarted [11:20:32] 10Analytics, 10Operations, 10Security-Team, 10CAS-SSO, and 2 others: Log / alert on too many failing logins / Throttling login attempts - https://phabricator.wikimedia.org/T233944 (10MoritzMuehlenhoff) [11:22:38] 10Analytics, 10CAS-SSO, 10User-Elukey: Secure Hue/Superset/Turnilo with CAS (and possibly 2FA) - https://phabricator.wikimedia.org/T159584 (10MoritzMuehlenhoff) [11:22:40] back! [11:23:25] Hi :) [11:27:31] Wow - Thanks elukey for the restarts and all [11:28:06] elukey: any idea how we ended up with a corruipted oozie sharelib? [11:28:29] joal: not corrupted, the dir got re-created and the old one dropped [11:28:42] no idea why, we had this problem before :( [11:28:53] so oozie was freaking out [11:30:09] hm - second time we see sharelib dropped/recreated without us being at the action button [11:30:13] !log use /run/user as kerberos credential cache for stat1005 [11:30:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:30:33] elukey: If ok for you, I think we should investigate (one more thing :( [11:33:07] definitely [11:34:36] the main issue is trying to figure out who/what drops the lib [11:34:46] because in theory it shouldn't be the oozie shlib script [11:34:52] since it executed on the 29th [11:39:18] I checked in the trash of hdfs, oozie, analytics and can't find it [11:46:02] :( [11:47:17] interestung [11:47:18] elukey@stat1005:~$ ls -dl /mnt/hdfs/user/oozie/share/lib [11:47:18] drwxr-xr-x 3 99 hadoop 4096 May 6 08:19 /mnt/hdfs/user/oozie/share/lib [11:47:35] so the dir's mtime is today at 8:19 UTC, when the issue started [11:49:37] so something has deleted the dir right before oozie freaked out [11:50:30] elukey: I can't imagine it's not puppet [11:50:38] hey teamm [11:50:51] uou, alarms [11:53:04] hey mforns [11:53:15] joal: not sure, I can't really find any trace of it [11:54:05] hm [11:54:07] ah but maybe the hdfs-audit logs have it [11:54:11] checking [11:56:13] 2020-05-06 08:19:59,109 INFO FSNamesystem.audit: allowed=trueugi=oozie/an-coord1001.eqiad.wmnet@WIKIMEDIA (auth:KERBEROS)ip=/10.64.21.104cmd=deletesrc=/user/oozie/share/lib/lib_20200204183338dst=nullperm=nullproto=rpc [11:56:17] loooool [11:56:36] it's oozie itself! [11:56:38] aahahaah [11:57:10] ok wait too weird [11:57:33] lemme check when I have executed the shlib update command just in ase [11:57:36] *case [11:58:49] that was around an hour later (UTC) [11:59:43] Oozie will automatically clean up old ShareLib lib_ directories based on the following rules: [11:59:46] After ShareLibService.temp.sharelib.retention.days days (default: 7) [11:59:49] Will always keep the latest 2 [11:59:51] * elukey cries in a corner [11:59:53] joal: --^ [12:00:07] 29th -> 6th [12:00:09] one week [12:00:13] sharp troubleshooting O.o [12:00:42] WATTTTT ! [12:00:56] elukey: kudos for hdfs-log digging! [12:01:17] elukey: so oozie recreates it's sharelib every week ??? [12:01:27] nono I think that this happened [12:01:55] 1) the puppet exec triggered for some reason (like the "unless" being false due to a network glitch) [12:02:05] 2) the new shlib gets created on the 29th [12:02:19] 3) after a week (today) oozie decides to clean up [12:02:45] the assumption that oozie makes is that it can safely delete stale dirs [12:03:03] joal: --^ [12:03:35] elukey: there still is something I don't understand - on the 29th, puppet recreates an oozie sharelib - So we have 2, correct? [12:03:45] or does it drop the previous one? [12:04:08] the former [12:04:35] from the hdfs-lob I assume we have 2 - the one 20200204 and the one created the 29th (20200429) [12:04:48] after 2) yes [12:04:52] the exec is [12:04:55] kerberos::exec { 'oozie_sharelib_install': [12:04:55] command => "/usr/bin/oozie-setup sharelib create -fs ${hdfs_uri} -locallib ${oozie_sharelib_archive}", [12:04:58] unless => '/usr/bin/hdfs dfs -ls /user/oozie | grep -q /user/oozie/share', [12:05:01] user => 'oozie', [12:05:02] see the 'unless' ? [12:05:05] require => [Cdh::Hadoop::Directory['/user/oozie'], File['/usr/bin/oozie-setup']] [12:05:08] } [12:05:21] if that fails for a network issue for example a new dir gets created [12:05:46] right elukey - so in case of network error, or cred error (yesterday pinging today ...) [12:05:57] We have a new dir [12:06:12] BUT - that new dir is not the one used by oozie ? [12:06:32] exactly, since no shared lib upgrade is executed [12:06:58] so the old one is still used [12:07:04] right - but oozie feels free to drop the folder that is still in use - MAAAAN that last sentence makes me feel so bad [12:07:11] https://issues.apache.org/jira/browse/OOZIE-1783 [12:07:23] it used to be doable only during oozie startup [12:07:32] but they thought to make it more interactive [12:07:34] to please people [12:07:40] so now oozie does it live [12:08:07] doing seppuku basically [12:08:15] xD [12:08:38] I'm in the middle of /o\ and :D [12:08:49] I kinda don't know where to be [12:09:10] I get the feeling [12:09:12] shareLib clean deletes the lib in use ????? I can't imagi [12:09:57] I am trying to see if there is way to tell oozie "PLEASE DON'T DO ANYTHING MATE" [12:10:46] mforns: can you please double check failed/restarted jobs? [12:10:57] joal: sure [12:11:09] Thanks mate :) [12:14:41] so I propose something like oozie.service.ShareLibService.temp.sharelib.retention.days=365 [12:15:01] or even maxint :D [12:15:20] in the meantime, updating the email thread [12:16:28] elukey: https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_command-line-installation/content/set_up_oozie_configuration_files.html [12:16:57] ahahha 1000 [12:17:13] I like it [12:17:35] elukey: let's not forget the 'interval' prop [12:17:59] joal: I think that even 1000 days alone would be safe :D [12:18:05] :) [12:18:11] I hope to not have oozie in 1000 days :D [12:18:24] elukey: oozie will remind us 1000 days after its restart ;) [12:28:52] !log re-run pageview-druid-hourly-coord for 2020-05-06T06:00:00 after oozie shared lib update [12:28:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:33:00] ah snap I missed one sorry :( [12:34:09] no problemo, joal asked me to re-check them, last one is running [12:34:20] Thanks both of you :) [12:36:23] code review in https://gerrit.wikimedia.org/r/#/c/594703/ [12:36:28] if you guys want to check [12:37:13] lukin [12:49:22] !log restart oozie on an-coord1001 to pick up the new shlib retention changes [12:49:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:49:43] thanks mforns [12:54:38] all right oozie should be fixed [12:54:46] \o/ [12:54:52] Thanks again elukey [12:55:18] yay [12:56:59] elukey: also - Should we tell oozie to upgrade sharelib if we create a new folder? [12:59:05] 10Analytics, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: Data request for logs from SparQL interface at query.wikidata.org - https://phabricator.wikimedia.org/T143819 (10dcausse) [13:03:35] joal: not sure, let's discuss with Andrew.. ideally it shouldn't happen :) [13:33:43] joal: whenever you have time I'd need to chat about kerberos credential cache :) [13:34:21] when you wish elukey [13:36:39] joal: lemme fix one thing since I am stupid [13:36:46] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Automate ingestion and refinement into Hive of event data from Kafka - https://phabricator.wikimedia.org/T251609 (10mforns) This idea is probably naive and far from what we have now, but maybe: Could we have... [13:37:15] elukey: please fix, and please stop tell me lies :) [13:37:44] ok bc?? [13:38:19] sure elukey [13:47:54] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Automate ingestion and refinement into Hive of event data from Kafka - https://phabricator.wikimedia.org/T251609 (10Ottomata) We could, but I'm not sure what that would gain us! :) We still need a way to ide... [13:53:13] crazy find elukey. I don't even understand the purpose of that feature, are we supposed to be updating shared libs more often or something? [13:58:30] milimetric: I have no idea :) [13:59:34] right... like... someone *wanted* this at some point. I want to meet that person [14:15:02] https://github.com/openjdk/jdk/blob/master/src/java.security.jgss/share/classes/sun/security/krb5/internal/ccache/FileCredentialsCache.java#L448-L456 [14:15:05] joal: --^ [14:15:07] * elukey cries [14:15:27] /o/ [14:15:31] (03PS1) 10Milimetric: Use new page move incremental updates [analytics/refinery] - 10https://gerrit.wikimedia.org/r/594719 (https://phabricator.wikimedia.org/T249773) [14:15:46] elukey: I guess we're gonna need to use the env var :S [14:18:09] 10Analytics, 10Performance-Team (Radar), 10Vue.js (Vue.js-Search): Revise schema and performance dashboards for Vue.js search - https://phabricator.wikimedia.org/T250336 (10Niedzielski) [14:18:44] joal: that is way more invasive sigh [14:33:12] hiyaaaa my laptop is not booting, am on an old computer with less ability to log into things atm... [14:44:54] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Patch-For-Review: Automate ingestion and refinement into Hive of event data from Kafka - https://phabricator.wikimedia.org/T251609 (10mforns) Yes, I imagined it would be easier to do it as soon as possible in the pipeline (Kaf... [14:49:40] 10Analytics, 10Product-Analytics: Can't publish my draft dashboard on superset - https://phabricator.wikimedia.org/T248904 (10mforns) I can see the dashboard mentioned in the description is published. @Esanders Is this issue solved then? Thanks :] [14:55:08] 10Analytics: Check home/HDFS leftovers of anomie - https://phabricator.wikimedia.org/T250167 (10mforns) @AMooney ping? :-) [15:26:26] 10Analytics, 10Operations: Analytics1060 unresponsive - https://phabricator.wikimedia.org/T251973 (10colewhite) p:05Triage→03Medium [15:30:26] 10Analytics, 10Operations: Analytics1060 unresponsive - https://phabricator.wikimedia.org/T251973 (10elukey) 05Open→03Resolved a:03elukey [15:46:25] joal: the hdfs-rsync that handles mediawiki-history-dumps is the java one? I thought it was the one you wrote... [15:47:29] 10Analytics, 10Analytics-Kanban: hdfs-rsync of mediawiki history dumps fails due to source not present (yet) - https://phabricator.wikimedia.org/T251858 (10mforns) a:05JAllemandou→03mforns [15:52:15] hi! I am using the archiva ci credentials mentioned here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source#Changing_the_archiva-ci_password and I was wondering if the password has expired [15:52:32] 10Analytics: Check home/HDFS leftovers of anomie - https://phabricator.wikimedia.org/T250167 (10AMooney) @mforns, thanks for the ping. I am checking around for someone with access. [16:06:24] maryum: hey! do you see a password in there? [16:06:57] elukey: I don't have permission....I know it was there because our job was working but now it's failing so I was wondering if the password has expired like mentioned on that page [16:07:34] ahhh ok because I can't see it too [16:08:26] so the pw in password store (only for sres) seems not working [16:08:37] ottomata: did you change the pass of archiva-ci recently? [16:09:15] hmmm i might have... [16:09:20] beacuse it expired [16:09:53] yargh but [16:09:59] hm [16:10:05] i am never able to update pwstore [16:10:20] because of expired user keys [16:10:24] i might have given up [16:10:40] and now, my main computer is on the fritz and is restoring from backup atm [16:10:52] and i can't get the pw out of my pw manager until it boots... :/ [16:11:16] it would have been a couple of months ago though [16:11:24] maryum: how long has your job been failing? [16:12:06] i think i was able to disable the archiva-ci password expiration too [16:12:08] not sure though [16:12:20] ottomata: it was passing on Monday and failed earlier today. we don't run it all the time [16:12:37] I don't think I would even be able to get access to pwstore as I'm not an SRE [16:13:01] aye [16:13:04] yeah i haven't changed it since monday [16:13:16] oh okay....then we have some other issue....super strange. thanks! [16:13:30] maybe the pw did expire in archiva? [16:13:32] it is possible [16:14:15] I can't log into archiva either to check [16:14:24] well not as the admin user [16:15:26] I can as admin but I don't see if it is expired [16:17:26] elukey: hmm okay, maybe the password is okay then. difficult to tell [16:19:19] maryum: is it blocking you right now? (I guess yes) [16:19:42] ottomata: one thing that I could do is to generate a new pw and then update jenkins/archiva [16:19:52] and try to save it on pwstore [16:20:25] elukey: it's not an immediate blocker but we plan to use this job once a week for deploys [16:20:53] ottomata: that would be helpful, and then if the job is still failing then it must be something else [16:21:16] elukey: +1 [16:21:31] maryum: ack I'll try to regenerated later on, is it ok? [16:21:39] (need to run some errands sorry) [16:21:44] elukey: yes that is fine, no rush [16:21:49] super thanks :) [16:21:54] * elukey errand for a while [16:54:38] 10Analytics, 10Analytics-Kanban, 10Research, 10Patch-For-Review: Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves - https://phabricator.wikimedia.org/T249773 (10Milimetric) @Isaac, I was finally able to run this successfully. I'm vetting the data a little bit now, basically j... [17:00:49] nuria: i'm pretty sure ssh access is needed to use superset? elukey right? [17:01:01] the accounts need to exist on the namenode still, no? [17:10:29] nuria: to use presto yes [17:10:38] since it checks credentials to access hdfs files [17:13:43] (logging off, will check later :) [17:13:53] err: ottomata: --^ [17:14:17] laters! [17:15:08] 10Analytics, 10LDAP-Access-Requests, 10Operations, 10Patch-For-Review: LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10colewhite) 05Open→03Resolved ah212 added to `wmf` ldap group. Please feel free to reopen if you encounter any... [17:32:09] 10Analytics: Check home/HDFS leftovers of anomie - https://phabricator.wikimedia.org/T250167 (10AMooney) @tstarling, Do you have ssh access, so that you can access these files and copy them to your home dir? I'd like to ensure that we do not need them. [17:51:11] ottomata: can you do nested string interpolation in puppet? like: $a = 'Blah' $b = "Hello, ${a}" $c = '!' $d = "${b}${c}" [17:54:10] I guess you can, asking because maybe there's some weird double escaping that needs to be done? [18:08:58] yes [18:09:07] that shoul be fine [18:09:17] it isn't working for you mforns ? [18:09:46] no no, just checkig that there wasn't any weird thing, i.e. with backslashes or sth [18:09:49] thx ottomata [18:33:12] 10Analytics, 10Analytics-Kanban: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10mforns) a:03mforns [18:40:05] mforns: sorry missed your ping and then left for diner - hdfs-rsync is the tool I wrote in scala, launched by java [18:40:37] joal: aaah... sorry [18:40:50] np mforns :) [18:41:20] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594773/ [18:42:10] mforns: I like it very much the way you did it :) [18:42:24] :D [18:42:25] mforns: hdfs-rsync is supposed to fail when the source is missing [18:42:26] :) [18:43:05] ok, yea I thought to pass a new parameter to it, so that it could choose to fail or not, but if you like it in puppet, I'm happy :] [18:43:14] Thanks a lot mforns! The puppet is also extremely good looking (I'd have gone for the hard-coded version) [18:43:36] * mforns blushes [20:38:59] 10Analytics, 10Analytics-Kanban, 10Research, 10Patch-For-Review: Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves - https://phabricator.wikimedia.org/T249773 (10Isaac) > Question: is that ok? I can easily regenerate the 2020-03-02 snapshot as if it was generated by the old log... [20:50:47] 10Analytics, 10Analytics-Kanban, 10Research, 10Patch-For-Review: Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves - https://phabricator.wikimedia.org/T249773 (10Isaac) Quick context for snapshot ranges -- I checked via this query (I assume the spillover to April is unpredictab... [21:06:25] 10Analytics, 10Growth-Team, 10Product-Analytics (Kanban): Hash edit session ID in EditAttemptStep and VisualEditorFeatureUse whitelisting - https://phabricator.wikimedia.org/T244931 (10nettrom_WMF) 05Open→03Declined After discussing this with Analytics Engineering, I think it's clear that we don't want t... [21:22:31] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10colewhite) p:05Triage→03Medium