[00:49:14] PROBLEM - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:06:55] goood morning [07:25:39] Good morning [07:30:42] bonjour! [07:30:49] I was looking to https://github.com/apache/spark/blob/branch-2.4/conf/metrics.properties.template [07:31:12] there is a "ConsoleSink" that could be interesting to test, it logs metrics every X seconds [07:31:20] Nice :) [07:32:09] spark 3.0 supports prometheus and jmx [07:32:26] Support for prometheus is a nice addon :) [07:32:49] we are basically flying blind, I checked the last failure and the jvm seems ooming while storing on the heap [07:33:11] but it is not clear to me if it is because fetching too much data from the shuffler, or something else [07:33:58] we can try with a beefier max heap for refine, like 8G [07:34:04] for the time being I mean [07:34:20] we can also test those metrics and see if they can help of if they spam [07:34:42] hum [07:36:45] elukey: we've done two changes with ottomata in the last patch: add the logic to be able to get mlaformed data, and add some caching [07:36:53] elukey: we can try to remove caching [07:37:58] The thing I don't understand though is: the jobs succeed when run one at a time, and fail when run in conjunction to others - This is a signal that the problem is related to having many jobs run in sequence [07:38:09] yep I agree [07:38:56] or it could be that a single job gets into a specific corner case, that causes some excutors to fail [07:39:20] elukey: it would multi-job related nonetheless [07:40:30] ? [07:40:38] elukey: arf sorry [07:41:05] elukey: I mean the problem is related to refining multiple datasets in a single spark instance [07:42:17] the main issue is that we can only speculate every time that an issue occurs, we don't really have any proof other than stacktraces (that are not really helpful) [07:42:33] it is a sign of missing metrics and instrumentation [07:44:11] elukey: I agree we're missing instrumentation - let's try the metrics [07:45:41] going to do some tests on hadoop test to see how it goes [07:46:06] Also elukey: the data of the failed job is not even big: 508Mb [07:46:24] I feel our problem is related to too many refined dataset in parallel [07:51:57] it seems a good lead theory.. is it the parallel task setting that Andrew tuned? [07:52:05] elukey: yes [07:52:16] joal: we can try something like 32 if you want [07:52:53] elukey: the error is seldom enough that it'll be complicated to actually monitor :( [07:53:14] elukey: I'm drilling down in logs a bit, trying to make more sense [07:54:40] we can also use a very coarse grained approach, like how many refine failures we get.. the main goal for the moment is to reduce toil [07:54:54] namely us to re-run things on a daily basis [07:55:20] yup [07:55:22] hum [08:01:11] elukey: Refining 8 dataset partitions in into tables `event`.`Test`, `event`.`HelpPanel`, `event`.`TemplateWizard`, `event`.`ContentTranslationAbuseFilter`, `event`.`HomepageModule`, `event`.`NewcomerTask`, `event`.`SpecialInvestigate`, `event`.`SearchSatisfaction` with local parallelism of 24 [08:01:37] the parallelism level set in the job has no effect - spark uses max-xpu [08:03:18] lovely [08:04:38] And, the job successfully refined 7 datasets, and failed one [08:04:47] And, all the datasets are very small [08:04:58] less than 10k records [08:07:37] elukey: I'm gonna try to re-refine on test DB and monitor [08:07:56] -- Gauges ---------------------------------------------------------------------- [08:07:59] application_1608216118485_10852.25.ExternalShuffle.shuffle-client.usedDirectMemory value = 16777216 [08:08:02] application_1608216118485_10852.25.ExternalShuffle.shuffle-client.usedHeapMemory value = 16777216 [08:08:25] Nice :) [08:09:08] it is not the best of the world since it is basically a periodic dump of values in the logs, on grafana it would be so awesome [08:09:30] elukey: there is a statsd pluggin I think [08:10:10] elukey: memory has not been changed for refine_eventlogging_legacy - 64 workers (enough) but 2G executor memory [08:10:42] we could think about having a prometheus statsd exporter on all worker nodes [08:10:50] that is polled by the prometheus masters [08:11:50] joal: ah good catch! Let's make it bigger then [08:12:12] I'm also trying to replicate the error in my own DB [08:14:47] I created https://gerrit.wikimedia.org/r/c/operations/puppet/+/655016 in the meantime [08:16:12] Works for me elukey [08:17:31] ack merging [08:18:21] !log raise default max executor heap size for Spark refine to 4G [08:18:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:19:17] 10Analytics: Kerberos principal for kharlan - https://phabricator.wikimedia.org/T271467 (10Peachey88) [08:20:56] 10Analytics, 10Beta-Cluster-Infrastructure, 10Event-Platform: [Cloud VPS alert] Puppet failure on deployment-schema-2.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T271508 (10hashar) [08:22:21] wow [08:23:11] elukey: manual test uses 48 local parallelism MEH [08:24:33] I am disabling the hdfs-cleaner timer via puppet for few days [08:24:59] Interesting - event raw data is owned as analytics/analytics making it unreadable to users [08:25:29] elukey: this means the perm change you did yesterday for data-quality will not be deleted, right? [08:25:31] what raw data? I mean, what paths [08:25:46] joal: also the /tmp/DataFrameToDruid dir [08:25:58] Right elukey [08:27:01] elukey: hdfs dfs -ls /wmf/data/raw/eventlogging/eventlogging_SearchSatisfaction/hourly/2021/01/07/20 [08:27:34] elukey: folders are correctly owned by analytics/analytics-privatedata-users [08:27:43] but the files are owned by analytics/an [08:27:49] analytics/analytics sorr [08:27:57] This is camus related [08:28:53] Indeed, same for webrequest [08:29:17] I'm gonna make my tests run with user analytics, but this is not optimal [08:29:52] joal: you can chown some files that you need and avoid the analytics user [08:29:55] if it is a problem [08:30:13] elukey: not a real problem - I'll chown my destination folder :) [08:30:39] the main issue is that we have been working without really checking permissions for ages [08:30:44] sigh :( [08:30:58] * joal whistle slowly looking to the skies [08:44:57] !log force restart of monitor_refine_eventlogging_legacy_failure_flags.service [08:45:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:46:09] !log force restart of check_webrequest_partitions.service on an-launcher1002 [08:46:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:46:32] elukey: I found this: https://issues.apache.org/jira/browse/YARN-4714 [08:47:07] elukey: do we have virtual memory check enabled? [08:48:19] 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since January 4, 2021 - https://phabricator.wikimedia.org/T271511 (10Hydriz) [08:48:34] in this case joal I don't think that it is Yarn that kills the spark jvms, or I didn't find traces of that in the node manager logs [08:49:39] but I don't find it explicilty set in puppet [08:51:32] 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since January 4, 2021 - https://phabricator.wikimedia.org/T271511 (10Hydriz) [08:51:40] RECOVERY - Check the last execution of monitor_refine_eventlogging_legacy_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:53:04] RECOVERY - Check the last execution of check_webrequest_partitions on an-launcher1002 is OK: OK: Status of the systemd unit check_webrequest_partitions https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:56:52] elukey: I think it is actually - There would be no reason for the JVM to die otherwise, would it? [09:00:01] joal: if Xmx is set and the heap space is exhausted, and if -XX:die-on-oom (I don't recall exactly the naming) is set too then yes [09:03:10] 10Analytics: Add Spark metrics - https://phabricator.wikimedia.org/T271513 (10elukey) [09:03:57] opened --^ for the metrics [09:04:21] hm [09:04:43] Actually on my example the container got killed by yarn (logs say so) [09:05:56] and my latest test say: with local parallelism of 72 [09:06:02] * joal doesn't understand :( [09:07:37] ah ok if you see logs that yarn acted then definitely [09:08:00] I checked briefly this morning and I didn't find any trace, only the die on oom thing [09:11:05] elukey: my manual rerun of the failed hour succeeded, but with some memory error that spark recovered [09:12:48] joal: I re-ran it a while ago as well, didn't know you were working on it, hope that I didn't cause any interference :( [09:12:49] elukey: I hope that the move to 4G executor will sove our problem, but I still wonder how the thing is even possible (small data only) [09:13:10] elukey: I ran manual stuff on my own DB, no problemo [09:13:22] super okk [09:13:31] Thanks for the real rerun [09:13:57] let's add spark metrics asap and see what happens, I am confident that they'll give us some good info [09:14:09] aure elukey [09:14:15] s/a/s [09:14:33] now, the other fun and joy of the day, namely permissions :D [09:15:30] Ah I'm sorry, I can't work on that: permission denied [09:15:32] :D [09:15:35] ok - bad joke [09:15:47] * elukey cries in a corner [09:15:50] :D [09:17:21] ok - yesterday I got stuck elukey - Can we brainstorm on that? [09:18:02] sure gimme 2 mins to inject caffeine into my brain [09:51:35] aaand as always joseph was right [09:51:36] drwxr-x--- 2 hdfs hadoop 4096 Jan 8 09:50 current [09:51:38] :) [09:52:30] elukey: I've been wrong in many ways recently - I'm however glad we caught that problem before being stuck at deploy [09:54:08] PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -eqsin- on alert1001 is CRITICAL: 125.5 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [09:54:20] aouch [09:55:18] of course, why don't we add joy to this week :D [09:55:58] only one host [09:57:10] seems timeouts to kafka-jumbo1001 [09:58:25] 10Analytics, 10Datasets-Archiving, 10Datasets-General-or-Unknown, 10Internet-Archive: Mediacounts dumps missing since January 4, 2021 - https://phabricator.wikimedia.org/T271511 (10JAllemandou) Thanks for reaching @Hydriz - We indeed have issues publishing data in the past days, due to T271362 as you menti... [10:01:08] !log restart varnishkafka-webrequest on cp5001 - timeouts to kafka-jumbo1001, librdkafka seems not recovering very well [10:01:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:03:40] RECOVERY - cache_upload: Varnishkafka webrequest Delivery Errors per second -eqsin- on alert1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [10:19:10] tested my patch joal [10:19:11] drwxr-xr-x 14 hdfs hadoop 4096 Jan 8 10:18 current [10:19:21] \o/ [10:19:45] all my tests are failing for the moment elukey - I htink I know why, and it seems a bug [10:19:55] I am also comparing files inside a prev snapshot, it looks good [10:21:14] if you are ok I am going to merge https://gerrit.wikimedia.org/r/c/analytics/refinery/+/654833 [10:22:13] (03CR) 10Joal: [C: 03+1] "LGTM - Thanks a lot elukey" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [10:27:31] (03CR) 10Elukey: [V: 03+2 C: 03+2] "I have tested this on Hadoop test, and compared file permissions with a previous snapshot, everything looks good. Trace of execution (I de" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654833 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey) [10:44:35] (03PS4) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [10:45:41] (03PS5) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [10:46:43] (03PS6) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [10:58:53] elukey: --^ this has been tested successfully on hive - The only case not verified is the spark one for mediawiki-history-dumps [10:59:43] (03Abandoned) 10Joal: Update perms of oozie jobs writing public archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654904 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal) [11:02:05] joal: one qs - is hdfs dfs -dfs.permissions.umask-mode=$UMASK -mkdir -p $HDFS_DIR_ABS right? [11:02:20] a -D seems missing [11:02:24] elukey: I corrected that but probably forgot to push [11:02:34] ahhh okok super [11:02:53] (03PS7) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [11:02:58] thanks for checking elukey ! [11:04:26] will do another review in a bit! thank you! [11:04:39] elukey: in the meantime I'm testing the spark one [11:40:51] going afk for lunch ttl! [12:49:50] heyyy team! [13:02:44] morning team :) [13:19:41] (03PS1) 10Gehel: Adding Maven wrapper to conform to how other projects are built. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/655052 [13:28:42] (03PS2) 10Gehel: Adding Maven wrapper to conform to how other projects are built. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/655052 [13:34:51] (03PS3) 10Gehel: Adding Maven wrapper to conform to how other projects are built. [analytics/wmde/toolkit-analyzer] - 10https://gerrit.wikimedia.org/r/655052 (https://phabricator.wikimedia.org/T271541) [14:18:09] hola mforns and fdans [14:18:15] hola hola [14:18:42] mforns: I am about to create two tasks for the things that we patched yesterday, with some summary etc.. [14:18:54] so we can figure out how to proceed [14:19:13] elukey: ok, do you want to hear about yesterday's backfillings? [14:20:28] I have a question to ask [14:20:35] mforns: nono, did everything go fine or something explode? [14:20:36] ah okok [14:20:38] (03PS1) 10Elukey: Drop Debian 9 Stretch support [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655067 [14:20:40] sure do you want to bc? [14:20:40] (03PS1) 10Elukey: Add some info about building with Docker [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655068 [14:20:42] (03PS1) 10Elukey: Update some pypi dependencies to latest versions [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655069 [14:20:50] ta daaan --^ [14:20:58] elukey: no, let's do it here, it's silent friday [14:21:19] mforns: I am fine to meet, but I can understand if you don't want to see me :( [14:21:29] * elukey cries in a corner [14:21:39] elukey: xDDDD, ok, let's meet in bc! [14:21:42] omw [14:21:45] super [14:27:18] \o/ I confirm the spark dumps job works as well [14:27:48] wooooowwww [14:28:03] elukey: with your permission we can merge my patch above above ( ottomata: can you confirm it's ok for you?) [14:28:37] I need to drop again, but will deploy and restart jobs when back for real (+2h30) [14:30:37] joal: I am going to review the change before you come back! [14:30:44] <3 [14:31:00] mforns: your opnion is very much welcoime as well if you have time :) [14:31:29] ok joal will look! [14:42:34] (03CR) 10Mforns: [C: 03+1] "LGTM! I left 2 minor comments on the script file. Feel free to ignore, though." (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal) [14:47:49] (03PS1) 10Gerrit maintenance bot: Add in-tech.wikimedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655078 (https://phabricator.wikimedia.org/T271539) [14:50:38] (03CR) 10Joal: [V: 03+2] "Thanks @mforns for the review" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal) [14:51:00] (03PS8) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [15:08:54] (03CR) 10Elukey: [C: 03+1] "LGTM, I left a comment about group read vs read+execute perms, but really not a problem if everything works fine, we can change it later i" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal) [15:21:21] (03CR) 10Ottomata: [C: 03+1] Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal) [15:22:36] (03CR) 10Elukey: [C: 04-1] "Nope this was not intended in this way, sigh, too many things committed, will abandon and re-do" [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655068 (owner: 10Elukey) [15:22:41] (03Abandoned) 10Elukey: Add some info about building with Docker [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655068 (owner: 10Elukey) [15:22:53] (03Abandoned) 10Elukey: Update some pypi dependencies to latest versions [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655069 (owner: 10Elukey) [15:28:16] (03PS1) 10Elukey: Add some info about building with Docker [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655087 [15:28:18] (03PS1) 10Elukey: Update some pypi dependencies to latest versions [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655088 [15:29:20] (03Abandoned) 10Ottomata: Bump up refinery-source version to 0.0.143 [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654476 (owner: 10Razzi) [15:29:27] (03CR) 10Ottomata: [C: 03+1] Drop Debian 9 Stretch support [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655067 (owner: 10Elukey) [15:29:41] (03CR) 10Ottomata: [C: 03+1] "We'll be abandoning this repo soon anyway :)" [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655067 (owner: 10Elukey) [15:30:30] (03CR) 10Elukey: "I tried the create_virtual_env script on Docker and it worked fine, but this will probably need to be tested a bit in hadoop test before p" [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655088 (owner: 10Elukey) [15:34:50] ottomata: the privatetmp thing works on stat1004 :) [15:35:32] ottomata: what plans do you have for the jupyterhub repo? I wanted to upgrade some deps for security fixes, but if it is to be trashed I'll avoid [15:36:36] elukey: i'm testing the tmp thing on stat1006 [15:36:44] triied to run spark from a notebook after kiniting on ssh terminal [15:36:45] Exception in thread "main" java.io.IOException: Failed to create a temp directory (under /tmp) after 10 attempts! [15:37:00] oh [15:37:08] i think its also in the jupyter config lemme see [15:37:18] I am running puppet now on 1006 :) [15:37:33] ah it was already applied [15:37:49] # We allow users to only write to their homedirs and /dev/shm, and nowhere else [15:37:49] c.SystemdSpawner.readonly_paths = ['/'] [15:37:55] pathc oming [15:38:26] ah didn't see it thanks [15:38:33] elukey: newpyter doesn't use the jupyterhub deploy repo [15:38:48] it just runs jupyterhub out of anaconda-wmf [15:38:51] so its all in the deb package [15:39:00] ah right didn't know this bit [15:39:05] do you prefer me to abandon then? [15:39:31] no no [15:40:03] oh sorry [15:40:06] the jupyterhub one [15:40:10] sorry too many things at once [15:40:13] i dunno, y ou can remove it [15:40:15] doesn' tmatter i guess [15:40:26] no I mean the updates for the deps [15:40:36] systemd spawner, jupyterhub, etc.. [15:41:28] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/655090 [15:41:33] +1ed [15:42:17] elukey: if you want to upgrade deps for sec fixes [15:42:19] do it in anaconda-wmf [15:42:36] and we can try to get a new .deb with them included out in a week or two [15:42:38] that it is a blackhole for me, so good opportunity to learn, if you are ok yes :) [15:43:03] yaya its simple [15:43:03] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/anaconda-wmf/+/refs/heads/debian/README.debian.md [15:43:59] and extra deps and versions can be specified in [15:44:00] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/anaconda-wmf/+/refs/heads/debian/extra/ [15:44:07] conda-requirements if the things has a conda package [15:44:09] otherwise pip requirements [15:44:22] jupyterhub-systemdspawner [15:44:22] is in conda-requirements [15:44:31] feel free to add versions there [15:45:47] OH oops elukey i fixed the tmp thing in only the old swap jupyterhub :/ [15:50:33] ah fixing it for the other one too! SEnding cr [15:50:41] already did! [15:50:58] then you are super [15:51:05] all working now on stat1006? [15:51:14] ELUKEY IT WORKS! [15:51:14] yes [15:51:39] much better thank you [15:52:23] all right I am going to send an email to announce@, people will need to shutdown their notebook to get the new configs [15:52:32] but all the new ones should be able to work fine [15:54:43] actually, puppet is restarting jupyterhub [15:54:55] so their servers will need to be restarted i think the next time they log in [15:55:00] it will prompt them (it did me) [15:57:04] in theory no, if jupyterhub gets restarted the notebook should keep running, it is a separate systemd unit (at least this is my understanding) [15:57:30] they need to shutdown the notebook since the ephemeral systemd unit will vanish and a new one (without PrivateTmp) will be created [15:57:33] does it make sense? [15:59:08] elukey: it does, but my server was stopped after the puppet run [15:59:33] very strange [16:08:12] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Kerberos credential cache location - https://phabricator.wikimedia.org/T255262 (10elukey) >>! In T255262#6731979, @gerritbot wrote: > Change 650480 **abandoned** by Elukey: > [operations/puppet@production] profile::kerberos::client: change alterna... [16:19:18] razzi: goood morning! When you have time let's nuke analytics-tool1004 :D [16:19:31] 10Analytics, 10Research: Release dataset on top search engine referrers by country, device, and language - https://phabricator.wikimedia.org/T270140 (10Isaac) Thanks @Milimetric -- @bmansurov will be leading the technical work on this so we're going to start work on this and greatly appreciate whatever code re... [16:22:16] elukey: good morning, yeah, I'll decomm! [16:24:15] nice! [16:30:50] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by razzi@cumin1001 for hosts: `analytics-tool1004.eqiad.wmnet` - analytics-tool1004.eqiad.wmnet (**WARN**) - **Failed... [16:38:11] razzi: what is the error msg that you get? I am curious [16:38:36] ```Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox [16:38:36] Generating the DNS records from Netbox data. It will take a couple of minutes. [16:38:36] ----- OUTPUT of 'cd /tmp && runus...e asset tag one"' ----- [16:38:36] 2021-01-08 16:28:28,779 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox [16:38:36] 2021-01-08 16:30:44,827 [ERROR] Failed to run [16:38:36] Traceback (most recent call last): [16:38:36] File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 687, in main [16:38:37] batch_status, ret_code = run_commit(args, config, tmpdir) [16:38:37] File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 591, in run_commit [16:38:38] netbox.collect() [16:38:38] File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect [16:38:39] self._collect_device(device, True) [16:38:39] File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 198, in _collect_device [16:38:40] if self.addresses[primary.id].dns_name: [16:39:36] mmmm weird [16:39:50] can you add this to the task? So we can ping Riccardo next week in case [16:39:58] the full stack trace I mean [16:40:21] elukey: sounds good yeah [16:40:33] it is weird since the dns record is still active [16:41:06] but https://netbox.wikimedia.org/search/?q=analytics-tool1004 is empty in netbox, probably already wiped [16:41:15] 10Analytics-Clusters, 10Patch-For-Review: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 (10razzi) Here's the error from attempting to decommission analytics-tool1004: ` Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox Generating the DNS rec... [16:41:25] I am wondering if puppet disabled played a role [16:42:22] anyway, razzi, can you execute the cookbook for netbox dns to see if it prompts a diff for analytics-tool1004? [16:43:14] yup [16:43:17] https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:43:30] you can run sudo cookbook -d sre.dns.netbox "test" [16:43:49] I see a warning in icinga but from 30 mins ago, there might be more stuff pending [16:43:59] let's use the test run to be sure [16:44:02] razzi: --^ [16:46:58] elukey: I wasn't familiar with the "test" option and kicked it off like so: `sudo cookbook sre.dns.netbox -t T268219 'Decommissioned analytics-tool1004.eqiad.wmnet'` [16:46:59] T268219: Move Superset and Turnilo to an-tool1010 - https://phabricator.wikimedia.org/T268219 [16:48:14] razzi: ah okok all good, but please wait if the diff shows more records that analytics-tool1004's [16:49:49] elukey: looks good, diff only has analytics-tool1004 [16:50:06] super, lets do it [16:57:24] 10Analytics: Follow up on /tmp/DataFrameToDruid permissions after umask change - https://phabricator.wikimedia.org/T271558 (10elukey) [16:59:41] (03CR) 10Joal: Update permissions of oozie jobs writing archives (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal) [17:00:45] (03PS9) 10Joal: Update permissions of oozie jobs writing archives [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) [17:03:11] 10Analytics: Follow up on /tmp/analytics permissions after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10elukey) [17:04:20] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Portals, 10cloud-services-team (Kanban): dumps.wikimedia.org/other/pageviews/ appears to be stalled at 20210106-140000 - https://phabricator.wikimedia.org/T271561 (10Xaosflux) [17:07:04] 10Analytics, 10Analytics-Kanban, 10Dumps-Generation, 10Wikimedia-Portals, 10cloud-services-team (Kanban): dumps.wikimedia.org/other/pageviews/ appears to be stalled at 20210106-140000 - https://phabricator.wikimedia.org/T271561 (10Bstorm) [17:16:08] 10Analytics, 10Analytics-Kanban, 10Dumps-Generation, 10Wikimedia-Portals, 10cloud-services-team (Kanban): dumps.wikimedia.org/other/pageviews/ appears to be stalled at 20210106-140000 - https://phabricator.wikimedia.org/T271561 (10Bstorm) I suspect this is related to https://lists.wikimedia.org/pipermail... [17:37:54] 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10Ladsgroup) [17:45:19] heya folks [17:45:49] elukey: I know it's late - Asking permission (about persmissions) to deploy [17:45:52] 10Analytics: Follow up on Druid alarms not firing when Druid indexations were failing due to permission issues - https://phabricator.wikimedia.org/T271568 (10elukey) [17:46:12] joal: +1 :) [17:46:20] Starting the process [17:46:42] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for dpeloy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/654899 (https://phabricator.wikimedia.org/T270629) (owner: 10Joal) [17:47:32] razzi: I added subtask to https://phabricator.wikimedia.org/T270629, basically all the things that we did with Marcel yesterday [17:47:39] during the convo-debugging :) [17:49:16] elukey: nice, thanks for documenting [17:49:53] razzi: if you have questions we can follow up on meet [17:50:15] !log deploy refinery with scap [17:50:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:58:07] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy next week." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/655078 (https://phabricator.wikimedia.org/T271539) (owner: 10Gerrit maintenance bot) [18:03:59] !log Deploy refinery onto HDFS [18:04:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:06:39] * elukey drumrolls [18:06:52] :) [18:09:28] elukey: if you're still here, I think I got why the druid ingestion failed on the 3rd, it was us stopping systemd timers in an-launcher1002 for a couple hours. When we were looking at the time interval earlier today, we didn't account for the ingestion delay of 4-5 hours... This hole of the 6th is going to be covered during the weekend by the daily ingestion job as well, so I think we're good. [18:10:24] elukey: deploy worked - perms are good :) [18:10:30] * elukey dances [18:10:43] \o/ [18:10:45] elukey: I'm now gonna restart one oozie job and check [18:10:47] mforns: ahhhh yes yes it makes sense! [18:10:56] thanks a lot for checking <3 [18:11:00] np! [18:14:31] !log Restart projectview-hourly job (permissions test) [18:14:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:15:24] 10Analytics, 10Product-Analytics: Update Image usage metric - https://phabricator.wikimedia.org/T271571 (10Isaac) [18:35:58] razzi: just fyi, explained to Luca what we discovered yesterday, and he was OK with letting the daily loading automatically backfill the holes during the week-end. So, we will do that :] By Monday, all datasources should be fine. [18:54:12] !log Restart jobs for permissions-fix (clickstream, mediacounts-archive, geoeditors-public_monthly, geoeditors-yearly, mobile_app-uniques-[daily|monthly], pageview-daily_dump, pageview-hourly, projectview-geo, unique_devices-[per_domain|per_project_family]-[daily|monthly]) [18:54:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:13:11] 10Analytics: dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster - https://phabricator.wikimedia.org/T271362 (10elukey) We are in the process of fixing permissions for files, and restart the rsync jobs that pull data from HDFS to the labstore nodes (serving dumps.wiki... [20:14:41] 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10colewhite) p:05Triage→03Medium a:03colewhite Prometheus-statsd-exporter sits in front of statsv and relays the statsd traffic it recieves to graphite. [[https://grafan... [20:15:42] * elukey afk! [20:17:33] 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10colewhite) [20:17:38] Same - gone for tonight - Have a good weekend team [20:26:00] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (033 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [20:36:34] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [21:03:43] 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10Ladsgroup) oh I see. Thanks! [21:08:14] 10Analytics, 10Performance-Team, 10observability, 10Graphite: statsv seems to be down - https://phabricator.wikimedia.org/T271567 (10colewhite) 05Open→03Resolved [21:09:55] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [21:23:53] Hi, I'm looking at https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest which mentions "wmf.webrequest" tables etc. I'm logged into Hive on stat1007 and I'm in the "default" DB. [21:24:11] "show tables;" does not list "wmf.webrequest". Do I first need to create my own DB? Or do I need some special permissions? Or are the docs wrong? I'm obviously clueless. [21:29:43] Errrm, okay that wasn't my brightest moment and questions ever when it comes to reading ^. Still: [21:29:49] hive (wmf)> SELECT * FROM webrequest LIMIT 1; [21:29:49] FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest" [21:31:40] Ah, that's https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest#Usage_notes - alright, I should read instead of ridiculing myself more here I guess. Cheers & good weekend! :) [21:48:30] (03PS18) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [21:50:48] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (035 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [22:02:04] (03CR) 10Milimetric: [C: 03+2] Add Active Editors per Country metric to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/647792 (https://phabricator.wikimedia.org/T188859) (owner: 10Fdans) [22:04:01] (03Merged) 10jenkins-bot: Add Active Editors per Country metric to Wikistats [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/647792 (https://phabricator.wikimedia.org/T188859) (owner: 10Fdans) [22:05:01] (03PS12) 10Milimetric: Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans) [22:05:20] (03CR) 10Milimetric: [C: 03+2] Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans) [22:06:26] (03Merged) 10jenkins-bot: Wikistats testing framework: Replace Karma with Jest [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/648376 (owner: 10Fdans) [22:12:09] (03PS10) 10Milimetric: Upgrade Webpack from 2 to 5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/649311 (https://phabricator.wikimedia.org/T188759) (owner: 10Fdans) [22:12:16] (03CR) 10Milimetric: [C: 03+2] Upgrade Webpack from 2 to 5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/649311 (https://phabricator.wikimedia.org/T188759) (owner: 10Fdans) [22:13:19] (03Merged) 10jenkins-bot: Upgrade Webpack from 2 to 5 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/649311 (https://phabricator.wikimedia.org/T188759) (owner: 10Fdans) [22:39:46] (03CR) 10Bstorm: "This latest patch set does seem to work fine locally. Naturally it might be a bit different on quarry itself. There is some kind of dev se" [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [23:51:11] RECOVERY - Check the last execution of analytics-dumps-fetch-pageview on labstore1006 is OK: OK: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers