[00:48:28] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:52:12] 10Analytics-Data-Quality, 10VisualEditor, 10WMDE-TechWish, 10Editing-team (Tracking): Investigate missing dialog close events - https://phabricator.wikimedia.org/T272020 (10matmarex) >>! In T272020#6855052, @DLynch wrote: > Medium: there might be UX confusion on mobile. If you try to swipe-back intending t... [00:55:34] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:19:14] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:23:36] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:26:05] good morning! [07:08:50] 10Analytics, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10elukey) @fkaelin this is a very interesting topic, that would need to be discussed with the SRE team first, due to the security implications of using Docker in production hosts like the stat100x boxes (that are a speci... [07:18:20] 10Analytics, 10Patch-For-Review: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10elukey) @razzi a few suggestions: * You'd need a DNS domain first, otherwise superset-next will not point to any IP. Check what we do for the superset.wikimedia.org domai... [07:47:09] !log change gid/uid for druid + roll restart of all druid nodes [07:47:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:47:18] this is necessary since druid nodes are on buster [07:47:58] will do one by one [07:54:13] just done druid1001, simpler than expected [08:05:04] druid1002 done [08:18:16] Good morning [08:20:06] bonjour [08:20:12] druid1003 also done [08:20:21] looong morning [08:20:42] elukey: changing uuid IIUC? [08:22:42] yep [08:33:24] joal: only druid nodes are left (since they are on buster), after this I hope to start reimaging some hadoop worker nodes :) [08:34:01] the procedure (high level) will be [08:34:02] elukey: super :) [08:34:10] 1) drain the node from traffic and stop daemons [08:34:22] 2) chown to new gid/uid [08:34:25] 3) reimage [08:34:44] In theory with the current puppet setup etc.. the worker node should come up nicely [08:34:50] with the new settings [08:34:58] and eventually all nodes will have the same gid/uid [08:35:08] for mapred/druid/analytics/hdfs/yarn [08:35:18] That will be super great :) [08:38:28] ah I just realized that I missed a couple of users, analytics-search and analytics-product [08:39:54] Oh, no :( [08:40:05] and also analytics-privatedata [08:40:24] elukey: just to be consistent with myself: which nodes have you changed for UUIDs so far? [08:40:24] it will be a problem for dirs like /var/lib/hadoop/data/b/yarn/local/usercache [08:40:33] joal: only the buster nodes [08:40:56] elukey: this means test-cluster, druid - other? [08:41:12] yes correct, like an-airflow, launcher, etc.. [08:41:21] elukey: I have no clue about hosts by OS version - I only reason in systems :) [08:41:29] ack - thanks :) [08:41:58] also elukeym about the long-running thorium task, I have funny stuff to share (not really funny actually) [08:42:18] ? [08:42:21] other problems? [08:43:19] elukey: 2 things mainly: files names are changed (slightly - removing [] fopr instance), but inconsistantly, making it difficult to check (small number of files, I can recheck manually) [08:43:44] yes this is something that I had to do due to troubles with hdfs-rync [08:44:03] but it was ezache's old files [08:44:14] More complicated: `du` gives the file-size on disk, and the files are stored in 'sparse' mode, making number not coherent with `ls` and therefore hdfs [08:45:10] elukey: I'll recheck manually (some see `][` changed to `_`, others to empty [08:45:15] I think that next time I'll just order a node with TBs of space and that's it :D [08:45:20] me and my crazy ideas [08:45:28] :D sorry for the mess elukey :S [08:45:46] nono it is expected doing cleanups, but it is frustrating [08:45:53] elukey: it is [08:46:12] elukey: has bastion2 been reimaged? [08:47:16] joal: the 1002 one, I think so yes [08:47:22] you can use bast3005 [08:47:28] yes it is :) [08:47:33] ack elukey - triple checking with fingerprints on wikitech [08:48:21] joal: qq - can we drop test_mwh_joal_2020_09 from druid analytics? [08:48:32] we can elukey [08:48:35] I'll do that [08:48:48] super thanks a lot [08:49:08] also elukey - fingerprints are not yet updated on wikitech :) [08:49:52] joal: one thing that I always forget telling you - https://wikitech.wikimedia.org/wiki/Wmf-sre-laptop [08:50:04] it contains a lot of nice things like [08:50:15] wmf-update-known-hosts-production [08:50:24] AHHH1 [08:50:32] you run it and it populates the ssh wmf known hosts [08:50:42] so you have auto-completion for all hosts while doing ssh [08:50:50] that is really nice [08:51:11] it can be used by anybody [08:51:23] (IIRC you use debian right?) [08:51:47] elukey: debian I am :) installing now [08:52:18] it may require some settings in the .config etc.. [08:55:42] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:55:44] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:56:43] uff [08:56:53] this is really annoying [08:59:20] one historical down briefly and all brokers are hanging? [08:59:56] same thing as when we drop datasources [09:00:38] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:01:32] !log roll restart druid brokers on druid public - locked [09:01:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:03:08] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:03:08] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:03:16] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:06:02] ok so this is really weird [09:17:00] every time I see this problem the broker p50 latency goes to 10s [09:17:31] that should be the bucket that we have in the prometheus exporter [09:18:09] but I'd expect 5s [09:18:14] that is our limit [09:23:57] ok I am retrying with druid1005, I am curious to see if it re-happens [09:26:33] joal: I have the horrible feeling that this might be related to cache [09:26:40] broker cache to be precise [09:26:43] how I am still not sure [09:33:20] Populating segment-level caches on the Broker is not recommended for large production clusters, since when the property druid.broker.cache.populateCache is set to true (and query context parameter populateCache is not set to false), results from Historicals are returned on a per segment basis, and Historicals will not be able to do any local result merging. This impairs the ability of the [09:33:26] Druid cluster to scale well. [09:33:52] /etc/druid/broker/runtime.properties:3:druid.broker.cache.populateCache=true [09:33:55] /etc/druid/broker/runtime.properties:4:druid.broker.cache.useCache=true [09:34:01] so this is likely the issue [09:34:44] we don't have a big cluster, but we have a big datasource and a number of nodes >=5 (they suggest to use these settings for clusters <5) [09:34:51] https://druid.apache.org/docs/latest/querying/caching.html#query-caching-on-brokers [09:35:53] 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Kormat) >>! In T269211#6827743, @razzi wrote: > - Merge patch to setup clouddb1021 with new role: https://gerrit.wikimedia.org/r/c/operations/puppet/+/661528 This sho... [09:36:06] and https://druid.apache.org/docs/latest/querying/caching.html#query-caching-on-historicals [09:36:54] joal: does it make sense? Maybe we should just move caching on the historicals [09:37:18] and force brokers to cache only query result [09:47:54] * elukey coffee [10:04:26] elukey: sorry I was away - reading [10:08:29] elukey: Let's try to set segment-caching on historicals - I however don't understand why it would change the problem of AQS hanging results :( [10:10:02] joal: my theory is that local caching, that is on the heap, takes a toll on brokers (evictions/LRU/etc..) [10:10:31] I have been proceeding with my roll restart of the cluster, no impact so far [10:13:24] joal: another test that we could do is (if possible) to drop an old mw datasource manually or forcing the timer [10:13:35] we wipe the cache again to be sure [10:13:37] for sure elukey [10:13:39] and we proceed [10:15:01] 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Kormat) I'd recommend double-checking if https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging is still current with @Volans. There have been cha... [10:15:34] 10Analytics, 10SRE: Kubeflow on stat machines - https://phabricator.wikimedia.org/T275551 (10akosiaris) Pointing out that `access to Docker == root access on the host`. The reason for that is this very simple one liner `docker run -it --net=host --privileged debian:buster` (the provided image is unimportant, a... [10:28:51] 10Analytics-Clusters, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) We'll need to depool one clouddb* per section in order to populate the data cc @Bstorm [10:51:26] ah joal [10:51:28] /etc/druid/broker/runtime.properties:9:druid.cache.type=local [10:51:39] The local cache is deprecated in favor of the Caffeine cache, and may be removed in a future version of Druid. [10:51:48] * elukey plays sad_trombone.wav [10:52:03] hm :( [10:52:38] one change, that needs another change, that needs ... [10:53:13] nono in theory this is the default now [10:53:20] but we explictly set "local" in puppet [10:53:32] anyway, I'll also add it [10:53:36] right - we have not updated our setting to use caffeine I get that [10:54:49] yep yep what I meant is that it can be shipped separately [10:54:54] and I should probably do that [10:56:52] joal: https://gerrit.wikimedia.org/r/c/operations/puppet/+/666597 [10:57:19] from https://druid.apache.org/docs/latest/configuration/index.html#caffeine-cache I'd say to avoid changing the defaults [10:57:22] reviewing [10:57:55] (we have 8u272 as jdk) [11:08:52] !log restart druid-broker on an-druid1001 (used by Turnilo) with caffeine cache [11:08:54] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:15:55] joal: confirmed from the broker-metrics.log that caffeine is being used [11:16:01] \o/ [11:16:06] I'll have to update the config for the prometheus exporter [11:16:17] so if you are ok, I'd leave an-druid1001 for the moment with caffeine [11:16:28] and then after lunch I'll apply it everywhere [11:16:31] right - obviously the metrics are different [11:16:31] let it bake for a day [11:16:50] and then maybe apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/666598/1 [11:16:54] what do you think? [11:17:23] works for me elukey :) [11:17:41] ack thanks :) [11:17:42] elukey: shouldn't we apply caffeine everywhere before applying the historical-cache? [11:18:08] joal: yes yes sorry this is what I meant, apply caffeine on all, let it bake for a day, and tomorrow in case apply the rest [11:18:23] Yessir :) [11:18:26] perfect [11:29:28] all right going afk for lunch! [11:29:52] leaving a note in case I am not here and a rollback is needed: [11:30:08] - revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/666597 [11:30:15] - restart druid-broker on an-druid1001 [11:30:22] * elukey afk! [11:35:17] /srv/backup/public-datasets/all/multimedia/upload-wizard-funnel-relative.tsv [11:35:20] woops [11:57:10] 10Analytics-Clusters: /wmf/data/raw should be readable by analytics-privatedata-users - https://phabricator.wikimedia.org/T275396 (10JAllemandou) For the record: I had in mind that this data not being available to `analytics-privatedata-user` group was made on prupose, as users should access the refined version... [12:14:51] Heya mgerlach - your job on stat1008 [12:15:23] mgerlach: is preventing other users to function normaly :( [12:33:54] 10Analytics-Radar, 10WMDE-Templates-FocusArea, 10MW-1.36-notes (1.36.0-wmf.29; 2021-02-02), 10Patch-For-Review, and 2 others: Adjust edit count bucketing for CodeMirror - https://phabricator.wikimedia.org/T273471 (10awight) a:03awight [12:56:08] 10Analytics-Radar: Presto error in Superest - only when grouping - https://phabricator.wikimedia.org/T270503 (10JAllemandou) > Should order in the WHERE be impactful? It is! Partition fields of the where clause are worked on their own, so their order is not of importance, but for other fields it is. Clauses are... [13:01:37] mforns: I'd like to debug some reportupdater queries, but unsure how to do this efficiently. For example, we merged the codemirror/hive/toggles query yesterday and the query seems to run fine in a shell, but the data exported to Graphite is strangely truncated. [13:08:26] (03PS1) 10Kosta Harlan: homepagevisit: Make impact_module_state optional [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/666619 (https://phabricator.wikimedia.org/T270294) [13:09:00] (03CR) 10jerkins-bot: [V: 04-1] homepagevisit: Make impact_module_state optional [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/666619 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan) [13:18:02] (03PS2) 10Kosta Harlan: homepagevisit: Add new state for impact module [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/666619 (https://phabricator.wikimedia.org/T270294) [13:20:59] (03PS3) 10Kosta Harlan: homepagevisit: Add new state for impact module [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/666619 (https://phabricator.wikimedia.org/T270294) [13:29:06] 10Analytics, 10Analytics-Kanban, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 (10kostajh) hi @mforns, is there more info about why this is paused? I need to update HomepageVisit schema but the pag... [13:34:05] ls [13:34:12] wow, second time today [13:34:57] At least it's not a password :) [13:35:11] :) [13:38:32] klausman: I have seen you have data on hadoop for ATS-kafka - let me know if I can help with analysis [13:40:52] joal: sorry just saw this now [13:41:28] mgerlach: no big deal, machine is not down, but you for sure eat a lot of CPU :) [13:41:54] joal: i am checking [13:44:49] joal: will do. currently I am oscillating daily between working on k8s for Ml and working on ATS/VRN [13:45:16] And k8s in wmf puppet is... overwhelming, tbh [13:45:19] (03CR) 10Kosta Harlan: "Should we unprotected the schema page on metawiki so it can be modified to be in sync with this patch?" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/666619 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan) [13:47:36] klausman: :S - If can help underwhelm let me know [13:47:52] Will do. [13:47:54] mgerlach: CPU usage gone [13:48:08] mgerlach: not sure what you did, but it changed :) [13:49:30] joal: I am running the training-pipline for the link-recommendation model for enwiki https://github.com/wikimedia/research-mwaddlink/blob/main/run-pipeline.sh#L37 [13:49:47] I restrict to use 10 workers [13:50:29] elukey when connecting to stats machines, via a bastion, ssh informed me that "ECDSA host key for bast1002.wikimedia.org has changed" [13:50:31] I can reduce to fewer to limit CPU [13:50:52] is it safe to assume i can update signatures on my end and ignore the error? [13:51:18] gmodena: I confirm that - I asked elukey earilier on :) [13:51:31] joal awesome, thanks :D [13:51:34] mgerlach: let's wait and see how the job finishes [13:51:58] I must say, I have only used it only smaller wikis before so maybe this time it is maybe taking longer [13:52:15] mgerlach: I'm wondering about the 10 workers as the machine shows 32 cores, and 10 workers shouldn't overwhelm them [13:52:26] but thanks for pinging me to watch and pointing out problems, happy to adapt the script [13:52:32] anyhow, currently ok mgerlach :) [13:55:14] joal: grafana shows ~30% cpu-usage which is consistent with the 10 workers https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=stat1008&var-datasource=eqiad%20prometheus%2Fops&var-cluster=analytics&from=now-3h&to=now [13:56:04] indeed mgerlach - I have no clue why my htop showed quite some more usage [13:56:14] and slowness from the machine [13:56:24] thanks for checking mgerlach :) [13:58:07] joal: thanks, let me know if this appears again so I can check what might causes this, though i am trying to take steps to limit cpu-usage [14:03:06] !log roll restart druid brokers on druid analytics to pick up caffeine cache settings [14:03:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:09:07] !log roll restart druid brokers on druid public to pick up caffeine cache settings [14:09:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:17:24] 10Analytics-Clusters: /wmf/data/raw should be readable by analytics-privatedata-users - https://phabricator.wikimedia.org/T275396 (10Ottomata) Hm, I think we should encourage folks to use refined data, but the raw stuff should still be readable. It doesn't have any more privacy implications, and it will be usef... [14:18:50] 10Analytics, 10Analytics-Kanban, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 (10Ottomata) @kostajh it was paused because of a PHP client bug which is now fixed, so we are unblocked. @mforns is a... [14:26:06] (03CR) 10Ottomata: "Kosta I commented in task, I'll prioritize finalizing the migration today so we can unblock this." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/666619 (https://phabricator.wikimedia.org/T270294) (owner: 10Kosta Harlan) [14:35:48] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) The druid/mapred/yarn/hdfs/analytics users have all fixed uid/gids on buster nodes. But I realized that we forgot a few, namely: * analytics-privatedata * analytics-product * analyt... [14:49:26] 10Analytics, 10Patch-For-Review: Druid datasource drop triggers segment reshuffling by the coordinator - https://phabricator.wikimedia.org/T270173 (10elukey) Today I was stopping/starting druid daemons (one node at the time) to roll out the fixed uid/gid (unrelated task). After stopping the first historical, A... [14:49:40] 10Analytics, 10Analytics-Kanban, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 (10Ottomata) [14:56:34] hi joal, I'm around now and was going to do the deploy. Some stuff got merged yesterday so I thought it best to wait [14:57:29] Hi milimetric :) [14:58:07] milimetric: I can do it if you wish and/or help and/or pair :) [14:58:37] um, nothing complicated, just your change and one from marcel, a no-op aqs testing change [14:59:10] I'll deploy and let you know if I have trouble. And then I think I've cleared my plate 100% for gobblin. Oh, joal let me know what you think of my sqoop hack [14:59:19] (https://gerrit.wikimedia.org/r/c/analytics/refinery/+/666209 ) [14:59:29] reading [15:01:09] https://github.com/RadeonOpenCompute/ROCm/issues/1391#issuecomment-784184306 [15:01:16] not sure if you saw it yesterday --^ [15:01:21] looks very promising [15:01:30] I have seen that elukey - that's super great :) [15:08:03] 10Analytics, 10Analytics-Kanban, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 (10Ottomata) [15:08:25] elukey: hellooo, do you have any idea of why superset might be importing CSVs with latin-1 encoding? [15:09:15] the only parameter I see that controls encoding is in the SQLAlchemy engine initialization, but in theory the default is utf8 [15:09:29] 10Analytics, 10Analytics-Kanban, 10Growth-Team, 10Product-Analytics, 10Patch-For-Review: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 (10Ottomata) @kostajh it looks good from here, I'm going to make a patch to the relevant extension.json files to final... [15:09:37] and I don't see anywhere in the superset source that is changing this to latin1 [15:16:32] fdans: hellooo! Importing CSV?? [15:16:49] where do you load it ? [15:16:58] elukey: mysql_staging [15:17:07] (or something similarly named) [15:17:48] fdans: we don't backup that, it is only a scratch pad, please let's make sure this is understood :) [15:18:01] there is no guarantee on staging [15:18:40] anyway no idea about the latin encoding, maybe there is a way to force it via superset? [15:18:48] elukey: yessir [15:19:16] elukey: there is, in the db params, but before trying to change it I wanted to ask you if this rung a bell [15:19:42] like, in here: https://usercontent.irccloud-cdn.com/file/0IJZrqyT/Screen%20Shot%202021-02-24%20at%209.19.35%20AM.png [15:20:15] you should be able to add "engine_params": {"encoding": "utf-8"} [15:20:34] but in theory that's the encoding that SQLAlchemy uses by default [15:21:02] yes it makes sense, you are probably the first doing it [15:21:36] (03CR) 10Joal: [WIP] Update mysql resolver to work with cloud db replicas (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666209 (https://phabricator.wikimedia.org/T274690) (owner: 10Milimetric) [15:21:59] ah elukey, you're back :) [15:22:15] elukey: for the thorium check ,I need a new run of file-generation please [15:22:53] joal: sure, what command? [15:23:07] elukey: sudo find /srv/backup -type f -exec ls -l {} \;| awk '{print $2, $9}' > /home/joal/backup_files_size_ls.txt [15:23:27] joal: o/ hope you enjoyed your time off :) [15:23:29] elukey: ls gives me the size not as in blocks, while does so [15:23:39] Hi fdans - I did :) [15:27:01] awight: going to migrate your schemas on all wikis! [15:27:02] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/666659 [15:27:53] i'll wait until you are around so you can verify [15:29:39] joal: so pageview daily dump (https://hue.wikimedia.org/oozie/list_oozie_coordinator/0019114-210107075406929-oozie-oozi-C/) hasn't failed, but I should restart it with the temp table? Should I rerun everything from Feb. 9th? [15:29:48] (everything from when we did the upgrade?) [15:31:53] mforns: FYI I'm taking over the growth experiements schema migrations to unblock kosta. I hope that is ok! [15:31:55] milimetric: from 2021-02-18 IIRC [15:32:44] joal: how do you get the 18th? [15:32:53] ottomata: thank you! [15:32:56] milimetric: the impacting change was on that date [15:33:20] ah hi kostajh! so the growth schemas are all migrated on all wikis now [15:33:23] ah ok, thx! [15:33:23] see my comments in the ticket [15:33:42] there are still finalization steps, but you should be able to proceed with your schema changes if you need to [15:33:59] you'll just have to make the schema uri config changes in two places until the extension patch is merged and deployed [15:34:41] ottomata: Excellent, please go ahead. [15:34:49] oh awesome hello! [15:34:50] ok [15:35:17] awight: would you make patches to finalize these changes in relevant extension.json files? [15:35:19] joal: done! [15:35:29] like this [15:35:29] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/666651 [15:35:41] ack elukey - checking [15:35:45] ottomata: I was just going to ask. Yes, no particular timing constraint, right? [15:35:49] nope [15:36:05] we just have to maintain the config overrides in mw-config until those are deployed [15:36:32] Great, I'll prepare the patches now and merge within a week... [15:37:02] yeehaw [15:37:20] mforns: the table name in the script doesn't match the job: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/659306/4/hive/data_quality_stats/create_traffic_anomaly_checked_countries_table.hql [15:37:42] what shall I do? use the name the job expects and wait for the next deploy seems the easiest [15:38:26] milimetric: if you give me 5 minutes I'll send a patch [15:39:06] yeah, I can patch it but it's just another deploy... um, it's ok I got it [15:39:17] joal: no worries I'll take care of it the right way [15:39:24] milimetric: I can help :) [15:39:30] 10Analytics, 10Analytics-EventLogging, 10Community-Tech, 10Event-Platform, and 2 others: CodeMirrorUsage Event Platform Migration - https://phabricator.wikimedia.org/T275005 (10Ottomata) [15:39:34] 10Analytics, 10Event-Platform, 10WMDE-TechWish: ReferencePreviewsBaseline Event Platform Migration - https://phabricator.wikimedia.org/T275007 (10Ottomata) [15:39:40] nah it's brainless joal, save your brain [15:39:41] 10Analytics, 10Event-Platform, 10WMDE-TechWish: ReferencePreviewsCite Event Platform Migration - https://phabricator.wikimedia.org/T275008 (10Ottomata) [15:39:47] 10Analytics, 10Event-Platform, 10WMDE-TechWish: ReferencePreviewsPopups Event Platform Migration - https://phabricator.wikimedia.org/T275009 (10Ottomata) [15:39:56] 10Analytics, 10Event-Platform, 10WMDE-TechWish, 10Patch-For-Review: TemplateDataApi Event Platform Migration - https://phabricator.wikimedia.org/T275011 (10Ottomata) [15:39:58] ok milimetric - thanks [15:40:02] 10Analytics, 10Event-Platform, 10WMDE-TechWish: TemplateDataEditor Event Platform Migration - https://phabricator.wikimedia.org/T275012 (10Ottomata) [15:40:16] 10Analytics, 10Event-Platform, 10WMDE-TechWish, 10Patch-For-Review: TwoColConflictExit Event Platform Migration - https://phabricator.wikimedia.org/T275014 (10Ottomata) [15:40:21] 10Analytics, 10Event-Platform, 10WMDE-TechWish: TwoColConflictConflict Event Platform Migration - https://phabricator.wikimedia.org/T275013 (10Ottomata) [15:40:24] 10Analytics, 10Event-Platform, 10WMDE-TechWish: VisualEditorTemplateDialogUse Event Platform Migration - https://phabricator.wikimedia.org/T275015 (10Ottomata) [15:40:42] awight: oo also please verify that everything still looks ok from all wikis [15:41:42] (03PS1) 10Milimetric: Fix table reference [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666665 [15:42:01] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Fix table reference [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666665 (owner: 10Milimetric) [15:42:14] elukey: I'm super sorry I'm dumb :( [15:42:22] elukey: my command was incorrect :S [15:43:00] joal: how dare you joseph! :D [15:43:11] if you are dumb what am I? Brainless? :D [15:43:20] ottomata: Sure thing, let me know when the changes are expected to take effect. [15:44:13] awight: they should be out, it migth take a little while for resourceloader JS caches to expire [15:44:18] the PHP ones should be 100% now [15:44:30] kk [15:45:10] hi milimetric just saw your message [15:45:21] np mforns, I got it [15:45:33] sorry for the mistake [15:45:37] I was just trying to be lazy, jo kept me honest [15:45:50] oh it's a tiny one, no apologizing for tiny mistakes :) [15:46:35] I saw the review for that anyway, when you asked me to take a second look, so it's technically my mistake [15:46:43] (but I'm not apologizing for it :)) [15:47:11] hehehe [15:47:18] what was the problem, didn't understand [15:47:29] can not see it in the CR [15:48:52] oh! I see now... [15:50:04] makes sense, thank you for fixing! [15:50:17] 10Analytics, 10SRE, 10ops-eqiad: an-worker1112 reports I/O errors for a disk - https://phabricator.wikimedia.org/T274981 (10elukey) Started `sudo smartctl -t long /dev/sdl -d megaraid,11,` [15:51:09] you know, this scap deploy really wastes a ton of time. [15:51:28] we should have some way to deploy just refinery job changes separate from jars [15:51:52] elukey: sudo find /srv/backup -type f -exec ls -l {} \;| awk '{print $5, $9}' > /home/joal/backup_files_size_ls.txt [15:51:52] omg... look at those two lines... the spaces line up so well! [15:51:56] * milimetric retires, I'm so happy [15:52:29] milimetric: agreed for jars dpeloy [15:52:42] * joal runs after milimetric to keep him [15:53:53] dude, I'm at the top of my game. That's 5!!! alignments by accident. I just don't see how it gets better than that [15:53:53] https://xkcd.com/276/ [15:57:52] xD [16:01:23] milimetric: that is very possible to do [16:01:42] just make a new scap environment that uses the same targets file as the main one [16:01:48] but sets git_binary_manager: None [16:01:51] like thin does [16:02:01] then you'd just have to do [16:02:15] scap deploy -e no-artifacts [16:02:16] or something [16:03:00] !log deployed refinery [16:03:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:03:15] hmm actually no [16:03:16] that wouldn't work [16:03:21] ottomata: I asked if that exact thing was possible a year or so ago, and you said it wasn't :) [16:03:25] ottomata: I understood that point was to separate artifacts deploy from oozie [16:03:26] haha [16:03:34] because then the newly scap deployed dir woudln't have any artifacts! [16:03:41] what we want is a way to make it use the preivous deploys artifacts [16:03:42] rats [16:03:44] ottomata: this means 2 folders, and sync deploy when needed [16:03:59] correct ottomata [16:04:03] yeah, I remember that was the problem [16:04:04] ya true joal, we could remvoe the artifacts from refinery altogether and make another repo that just uses git fat [16:04:13] and symlink it into refinery [16:04:15] dir [16:04:33] milimetric: hacky way maybe [16:04:34] could we scap deploy refinery-source to do that? [16:04:41] ssh into target host [16:04:49] and just do a scap pull (?) or something not sure the right command [16:04:59] maybe even just a git pull would work [16:05:13] milimetric: hm we could but we'd have to add the packaged jars to the refinery-source repo then [16:05:19] nah, we should do this the right way. It just wastes like 10-15 minutes pretty much every week, so it's worth ~ 4-5 hours of work [16:05:24] and, not all artifact jars are fromo refinery-source [16:05:26] ottomata, milimetric: one concern I see with the approach is having to reference path for jars that are not 'relative' [16:05:31] but that's minor [16:05:34] ? [16:05:40] joal: we could symlink it into refinery [16:05:41] so [16:05:48] 10Analytics-Clusters, 10DBA, 10Patch-For-Review: Convert labsdb1012 from multi-source to multi-instance - https://phabricator.wikimedia.org/T269211 (10Marostegui) [16:06:02] /srv/deployment/analytics/refinery/artifacts -> /srv/deployment/analytics/refinery-artifacts [16:06:04] ottomata: I'm not sure how simlynks would work with deploys and hdfs - but maybe! [16:06:17] hmm hdfs probnably doesn't matter [16:06:24] we'd probably have to do it with puppet though [16:06:29] not have the symlink in git [16:06:40] hmmm but then teh link would disappera after each deploy [16:06:44] anyway yeah ok this is messy [16:06:46] probably la way [16:06:59] actually, probably the right way wwoudl lbe to use maven for deployment of jars? [16:08:37] joal: got a sec for a java q? [16:14:52] ottomata: sure! [16:15:01] ottomata: batcave? [16:15:05] ok! [16:22:50] ottomata: Verified eventgate traffic for all topics :-) [16:29:39] (03PS3) 10Neil P. Quinn-WMF: Fix inconsistent Hive query fail [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/665406 (owner: 10Milimetric) [16:36:50] * elukey afk! bbiab [16:37:34] awight: wooOHoOo! [16:41:18] gehel: yt? i have a spotbugs error in wikimedia-event-utilities i dont' understand (from discovery pom), and I think I should disable the check but im' not sure [16:41:39] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/659227 (https://phabricator.wikimedia.org/T272569) (owner: 10Andrew-WMDE) [16:43:54] (03PS1) 10Awight: Fix bad table name in query [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666685 (https://phabricator.wikimedia.org/T272569) [16:45:11] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/666685 (https://phabricator.wikimedia.org/T272569) (owner: 10Awight) [16:48:37] elukey: not yet correct :( [16:48:47] * joal hides in the corner :( [16:50:41] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/657635 (https://phabricator.wikimedia.org/T273474) (owner: 10Awight) [16:58:08] elukey: sudo find /srv/backup -type f -exec ls -l {} \; > /home/joal/backup_files_size_ls.txt [16:58:22] elukey: I'm super sorry :( [16:59:17] joal: ncdu is not an option? [17:00:13] klausman: I'm after the list of file, and their size, but not on disk, the 'user' one [17:00:50] a-team: sorry I'll be a little late for standup, I basically just deployed since yesterday and stared at the gobblin job I'm writing, planning on doing more of the latter going forward [17:00:57] klausman: reason for the various runs are: sparse-files (change from du to ls), awk-error, then awk split issue [17:09:05] ottomata: got a link to the code ? And what's the error ? [17:09:50] * gehel is cooking dinner, alone with 2 kids. Expect some lag ! [17:10:52] gehel: [17:10:55] https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-java8-docker/85/console [17:10:58] 12:02:31 [ERROR] Medium: Class org.wikimedia.eventutilities.core.json.JsonLoader has a circular dependency with other classes [org.wikimedia.eventutilities.core.json.JsonLoader] At JsonLoader.java:[lines 25-200] FCCD_FIND_CLASS_CIRCULAR_DEPENDENCY [17:11:06] https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/665415/11/eventutilities/src/main/java/org/wikimedia/eventutilities/core/json/JsonLoader.java [17:11:29] at first I thought it was because of the singleton JsonLoader instnace [17:11:44] since it mentions line 25 [17:11:50] but that doesn't seem right [17:11:51] That sounds like something you should not ignore [17:12:41] That's going to require more reading than what I can do on my phone. I'll have a look later tonight [17:12:57] oh oh HttpRequest uses JsonLoader ok i see the circle [17:13:55] meh i'll just avoid using JsonLoader in HttpRequest [17:17:59] joal: file created on thorium [17:18:03] ack elukey [17:20:35] ottomata: looks like you don't need me after all ! [17:22:37] yeah thank you i thoughht it was the singleton [17:22:43] thank you! [17:22:49] sorry to bother have a good eve! [17:24:06] Have fun ! [17:43:52] elukey: do you have a minute? [17:44:04] elukey: I'm about to finalize thorium [17:45:27] joal: in a meeting, will be free in ~45 mins [17:45:39] ok elukey [18:01:13] joal: https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/665415 is finally passing and ready for review [18:22:48] joal: I am free now [18:23:11] heya elukey - I have 2 diffs between thorium and backup [18:23:53] elukey: one is file name: hdfs://analytics-hadoop/wmf/data/archive/backup/misc/thorium/public-datasets/all/mwrefs/mwcites-20180301/viwiki.tsv.tar.gz._COPYING_ [18:24:19] the file size is the same on both HDFS and thorium despite the name being _COPYING_ [18:24:30] I think we can manualy rename it [18:25:22] And, /srv/backup/backup_wikistats_1/htdocs/FR/PlotDatabaseEdits1.svg has different sizes (997774 on thorium and 786432 on HDFS) [18:25:28] elukey: --^ [18:25:51] elukey: I'm heading to interview in 5 minutes [18:26:07] elukey: let's nail this down tomorrow morning? [18:26:46] sure! [18:26:53] thanks a lot for the analysis [18:27:12] np elukey [18:30:38] 10Analytics-Clusters, 10Patch-For-Review: Install Debian Buster on Hadoop - https://phabricator.wikimedia.org/T231067 (10elukey) https://gerrit.wikimedia.org/r/666657 needs some follow up on the following nodes first: ` elukey@cumin1001:~$ sudo cumin 'P{c:profile::analytics::cluster::users} and P{F:lsbdistcod... [18:50:15] (03CR) 10Mforns: "Hi Erin!" (037 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/666223 (owner: 10Erin Yener) [19:01:03] * elukey afk! [19:05:26] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure: MEP: Schema fragments shouldn't require fields - https://phabricator.wikimedia.org/T275674 (10mpopov) [19:09:03] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:11:22] hum hum humn [19:16:09] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:16:10] !log service hadoop-yarn-nodemanager start on an-worker1112 [19:16:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:16:59] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure: MEP: Schema fragments shouldn't require fields - https://phabricator.wikimedia.org/T275674 (10mpopov) [19:18:13] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure: MEP: Schema fragments shouldn't require fields - https://phabricator.wikimedia.org/T275674 (10mpopov) [19:45:03] 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10Ottomata) Marcel and I discussed and came up with a plan: - Rename WhitelistSanitization because T254646 - Rename EventLoggingSanitization job class and make i... [19:52:01] ottomata: I'm gonna sign off for tonight- let's talk about your patch tomorrow afternoon if you have a minute [19:55:09] 10Analytics, 10Event-Platform: Sanitize and ingest event tables defined in the event_sanitized database - https://phabricator.wikimedia.org/T273789 (10mforns) Would it be scope creep to add a way for Refine to not traverse part of a directory tree? This way we can have 2 sanitization jobs that go over the even... [19:55:50] joal: sure! [20:02:51] * razzi afk for lunch [20:22:15] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:24:35] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:55:06] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Product-Data-Infrastructure: Draft of full process for instrumentation using new client libraries - https://phabricator.wikimedia.org/T275694 (10mpopov) [20:55:57] 10Analytics, 10Event-Platform, 10Product-Analytics, 10Product-Data-Infrastructure: Draft of full process for instrumentation using new client libraries - https://phabricator.wikimedia.org/T275694 (10mpopov) [20:57:28] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure, 10Product-Analytics (Kanban): Draft of full process for instrumentation using new client libraries - https://phabricator.wikimedia.org/T275694 (10mpopov) [20:57:51] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure, 10Product-Analytics (Kanban): Draft of full process for instrumentation using new client libraries - https://phabricator.wikimedia.org/T275694 (10mpopov) p:05Triage→03Medium [22:18:52] 10Analytics, 10Event-Platform, 10Product-Data-Infrastructure, 10Product-Analytics (Kanban): Draft of full process for instrumentation using new client libraries - https://phabricator.wikimedia.org/T275694 (10mpopov) [22:48:28] 10Analytics: Newpytyer python spark kernels - https://phabricator.wikimedia.org/T272313 (10Ottomata) [23:03:33] 10Analytics: Newpytyer python spark kernels - https://phabricator.wikimedia.org/T272313 (10Ottomata) I'd really like to make [[ https://www.irccloud.com/pastebin/dQVrXPSq/ | Fabian's script ]] to auto pack and ship conda envs into yarn something we can use easily. It should work from the CLI as well as in Pytho...