[02:46:04] (03CR) 10Conniecc1: "> Patch Set 2: Code-Review-1" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607361 (https://phabricator.wikimedia.org/T256050) (owner: 10Conniecc1) [05:48:18] goood morning [05:48:49] !log decom analytics1049 from the Hadoop cluster [05:48:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:06:33] restarting from yesterday - it is interesting to compare an-worker1085 (12 disks worker) vs an-worker1096 (22 disks worker with GPU etc..) [06:07:06] the 1085 worker suffers for temporary disk saturation (even minutes): [06:07:09] https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?orgId=1&var-site=eqiad&var-cluster=analytics&var-instance=an-worker1085&var-datasource=thanos&from=1602612782505&to=1602613775886 [06:09:26] the super heavy bursts are related, afaics, from high write IOPS (~700/s per disk) [06:09:56] but sometimes even less [06:11:42] meanwhile 1096 seems to saturate as well sometimes (although the worker is new so it has surely less load) [06:11:57] but at a lower IOPS level, ~400/500 [06:12:08] see for example https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?orgId=1&var-site=eqiad&var-cluster=analytics&var-instance=an-worker1096&var-datasource=thanos&from=1602457333591&to=1602468970942 [06:14:19] the disks are different though, all HDDs but the GPU workers have the 2.5 ones (smaller, to fit the chassis) [06:15:01] I was kinda assuming that more disks would have helped in spreading the load [06:16:13] anyway, not a big deal, but maybe before ordering new workers we could have checked disk metrics, didn't even think about it [06:19:11] now the other thing to keep in mind, now that I think about it, is that we do leverage the dell hw raid controller for all those disks [06:19:30] we configure them as single disk raid0, getting write/read cache etc.. [06:19:53] (there is also a BBU battery as backup, and we fallback to writethrough in case it fails) [06:20:37] I am not sure how/if all these nodes are configurable to just use jbod and no hw raid config, but it could be interesting to measure perfs [06:31:34] (I am checking settings for all disks, I have them in a cookbook) [06:53:53] added comments in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/633941 [06:55:30] ok at this point I'd be interested to test WriteThrough policy [06:59:54] maybe our raid write cache is not great (I wouldn't be surprised) so it acknowledges fast write IOPS ending up in saturating the cache, and slowing down everything [07:17:40] 10Analytics: Check home/HDFS leftovers of joewalsh - https://phabricator.wikimedia.org/T265447 (10MoritzMuehlenhoff) [08:44:28] (03PS1) 10Elukey: Add an-test-coord1001 to the scap list for hadoop test [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/633953 [08:44:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add an-test-coord1001 to the scap list for hadoop test [analytics/refinery/scap] - 10https://gerrit.wikimedia.org/r/633953 (owner: 10Elukey) [09:37:37] * elukey afk for a bit [09:45:32] elukey: can the raid mode of the machines be switched without having to wipe the RAIDs themselves? I remember some controllers would basically invalidate all data if they were turned into dumb disk controllers. [09:48:22] Oh and morning everyone :) [10:18:14] klausman: morning! We can definitely switch writeback with writethrough without risking anything, but I am not sure if DELL's hw raid controllers allow to use plain jbod [10:19:44] In my experience it is rare, though definitely not impossible for caches to make congestion/blocking worse rather than better. If we can simply turn off the cache with no more disruption than maybe a reboot, we should definitely try it. Do we have some sort of benchmark or comparable workload? Also, what about HPE machines? [10:20:55] we don't have HPE hosts for analytics, except a few here and there (not in hadoop) [10:21:45] Right. I just remember at least one of the stat100x machines having an iLO :) [10:22:10] I mean it is not a pressing issue but there might be something that we can do about it:) [10:22:19] Ack. [10:23:36] the raid controllers that we have are not fancy ones, but I agree with you about the fact that it is probably not going to change anything (it may get worse without write cache) [10:25:51] Knowing if it would is better than speculating :) [10:26:33] klausman: about workload/benchmark, we could simply turn one node to writethrough and let it run for few days, and compare it with the rest [10:26:41] bursts of IOPS happen every now and then [10:28:14] Is this a kernel module or BIOS option? or can it be done on the fly with a CLi tool? [10:29:09] the last one, via the megacli tool [10:29:41] the settings that we use are in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/633941/3/cookbooks/sre/hadoop/init-hadoop-workers.py (see the commands) [10:29:43] Then we should totally do it. It's little time/energy invested and knowing if it makes a difference would be good. [10:49:24] ack, will open a task :) [11:00:39] klausman: ok to schedule a maintenance window for stat1005/stat1008 to reboot them? Friday morning? [11:00:53] SGTM [11:01:01] sending the email then [11:01:26] This si to ensure correct working of the rocm33 DKMS, right? [11:05:23] exactly yse [11:05:25] *yes [11:18:04] hellooo teamm [11:19:00] Heya! [11:21:55] 10Analytics-Clusters, 10Operations, 10Traffic, 10Patch-For-Review: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) a:05klausman→03ema Transitioning the task to Ema since he is following up with upstream to patch Varnish :) [11:23:14] hola marcel [11:45:51] going afk for a couple of hours, ttl! [13:25:52] 10Analytics-Clusters, 10Operations, 10Traffic, 10Patch-For-Review: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) I've opened https://github.com/varnishcache/varnish-cache/issues/3436 for 6.5/master, https://github.com/varnishcache/varnish-cache/issues/3437 for 6.... [13:34:20] 10Analytics, 10Platform Team Sprints Board (Sprint 5), 10Platform Team Workboards (Green): Ingest api-gateway.request events to turnillo - https://phabricator.wikimedia.org/T261002 (10eprodromou) I met with @millimetric and we think we can get all the info needed through superset. I'm going to close this tic... [13:34:32] 10Analytics, 10Platform Team Sprints Board (Sprint 5), 10Platform Team Workboards (Green): Ingest api-gateway.request events to turnillo - https://phabricator.wikimedia.org/T261002 (10eprodromou) 05Open→03Resolved [13:34:35] 10Analytics, 10MediaWiki-REST-API, 10Platform Team Sprints Board (Sprint 5), 10Platform Team Workboards (Green), 10Story: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10eprodromou) [14:10:14] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Yes, please merge when ready, thanks! [14:27:43] 10Analytics, 10Analytics-Wikistats: Wikistats 2.0: Add statistics for the geographical origin of the contributors - https://phabricator.wikimedia.org/T188859 (10CKoerner_WMF) IANAA (I am not an analyst) so there's little I can contribute to this task, but I wanted to give an example of how this sort of data co... [14:38:32] 10Analytics, 10Analytics-Wikistats: Wikistats 2.0: Add statistics for the geographical origin of the contributors - https://phabricator.wikimedia.org/T188859 (10Nuria) @CKoerner_WMF just so you know this data has been publicy available for now about a year, the task in question is to visualize it via Wikistats... [14:39:24] mforns: i was having yesterday the hardest time executing the entropy oozie job, can you share the command you have executed it with in the past ? [14:39:27] mforns: i did [14:39:28] oozie job -run -Duser=nuria -Darchive_directory=hdfs://analytics-hadoop/tmp/nuria -Doozie_directory=/tmp/oozie-nuria/ -config ./oozie/data_quality_stats/bundle.properties -Dgranularity=daily [14:39:40] with start/end dates on property file [14:49:20] 10Analytics-Clusters: Review recurrent Hadoop worker disk saturation events - https://phabricator.wikimedia.org/T265487 (10elukey) [14:51:44] mforns: nvm i see the command is on the readme [14:54:51] 10Analytics, 10Event-Platform, 10Product-Infrastructure-Data, 10Patch-For-Review: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set header values in event data - https://phabricator.wikimedia.org/T263466 (10Ottomata) Blocked on discussion in {T263672} [14:56:10] !log drain + reboot an-worker109[8,9] to pick up GPU settings [14:56:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:56:21] currently doing it via cookbook --^ [14:56:23] * elukey dances [15:03:06] nuria: the command in the readme I think executes the whole bundle... [15:06:52] oh, but that does not matter that much, because the job outputs to a test folder [15:29:06] !log drain + reboot an-worker110[1,2] to pick up GPU settings [15:29:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:59:07] !log drain + reboot an-worker1100 to pick up GPU settings [15:59:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:03:28] ping razzi [16:16:05] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Put 6 GPU-based Hadoop worker in service - https://phabricator.wikimedia.org/T255138 (10elukey) p:05Triage→03Medium a:03elukey [17:03:57] Heya team - sorry for lateness :S [17:04:15] milimetric: there have been a faile refined hour - Do you wish me to retry it? [17:04:39] joal: I think it's mforns' ops week [17:05:01] milimetric: on paper it's mine :) [17:05:13] oh no! I thought it was thursday my bad [17:05:15] sorry mforns [17:05:26] joal: wait, why are you asking then [17:05:32] no problemo! [17:05:55] hm, actually I can't view new meetings for ops-week from next thursday [17:06:22] joal: they're all-day events [17:06:26] milimetric: because of reorg of weeks, on paper it's mine, but I preferd to ask :) [17:06:34] Ohhhh! [17:06:39] oh! ok, sorry jo, no problem do restart [17:06:41] ok get it milimetric - :) [17:06:58] great - restarting refine for that failed hour [17:07:11] (I had to make them all-day meetings because otherwise google cal can't do the 4-week cycle starting on Thursday and ending the next Thursday) [17:07:29] it's great milimetric :) [17:07:40] Just as usual, a change that I didn't notice :) [17:08:00] ottomata: what do you think about auto-retrying refine in some clever way, like: [17:08:08] 1. try one hour later [17:08:13] 2. if it fails, try with drop malformed [17:08:31] 3. if that succeeds, check # records refined, if it's more than 95%, proceed [17:08:50] and only then send the error to us [17:10:50] hm, maybe milimetric for 2 [17:10:55] we could catch the error in Refine [17:11:01] and re-refine with drop malfomred [17:11:07] compare the # records as you say [17:11:10] and email [17:11:15] then we don't have to try one hour later [17:11:51] basically same idea, but just retrying within the same job, instead of scheduling another one [17:12:00] I'm thinking let an hour pass for any extraneous problems to resolve themselves (network, etc.) [17:12:12] but otherwise, yeah, sounds good [17:12:13] well, if we are just catching error for drop malformed [17:12:16] we could do immediately [17:12:20] ORF [17:12:27] yes, that step yeah, I meant the hour wait just initially, not after 2 [17:12:30] we could just always run with drop malformed, as long as we know the count that gets lost [17:12:39] and send an email on dropped records [17:13:01] but for other errors, like networking onesyeah [17:13:07] oh that makes sense too, in general, yeah, just set it to true by default and error out on some threshold [17:13:18] ok - failure of the job was due to cluster being busy (more than 4 FetchFailed exception) - Cluster seems ok now, will try to rerun [17:13:30] yeah the main reason we don't dorp malformed by default is so we don't silently drop tons of data [17:13:35] ^ right, like that just needs a gentle retry some time later [17:13:59] yeah, I'd say if we do drop more than threshold, still write _REFINE_FAILED and send email [17:14:02] it might be hard to reason about what needs to be retried manually [17:14:11] oh i guess we have lucas historical refine_faield checker [17:14:16] that would be good enough [17:14:25] Oh and by the way team - the train etherpad wass empty yesterday - I wanted to ask you folks if there was anything to dpeloy but forgot [17:14:45] Doing today a-team (sorry for the wilde ping) - Last call for deploy, other wise cancelled [17:14:47] joal: i was hoping to get my camus stuff in but i don't thikn i'll make it [17:15:06] ack ottomata - This is sensitive - better to fullproof [17:15:11] ya [17:15:21] no deploy needs for me joal :] [17:18:20] ack mforns - thanks :) [17:34:54] mforns: Would you have a minute? I have a question [17:35:02] yesssss joal [17:35:35] mforns: about rerunning a refine job - I'm following: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Refine#Rerunning_jobs [17:35:50] aha [17:35:55] And the command says: sudo -u analytics /usr/local/bin/refine_eventlogging_analytics --ignore_failure_flag=true --table_whitelist_regex=MobileWikiAppShareAFact --since=96 --verbose refine_eventlogging_analytics [17:36:19] mforns: Now my question is - What is the last `refine_eventlogging_analytics` doing there? [17:36:36] mmmmmmmm [17:36:42] uou, I think it's a typo [17:37:04] That was my guess as well, but I'd prefer double check before [17:37:24] let me see what I use usually... [17:37:43] mforns: here is my plan for the rerun: sudo -u analytics /usr/local/bin/refine_event --ignore_failure_flag=true --table_whitelist_regex=mediawiki_api_request --since=24 --verbose [17:38:03] mforns: thanks for triple checking with me, I'm not used to rerun refined jobs :) [17:38:16] I usually run it like: sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event --table_whitelist_regex=mediawiki_api_request --ignore_failure_flag=true --since='2020-09-25T13:00:00' --until='2020-09-25T14:00:00' [17:38:41] yep - almost the same except for better dates [17:38:47] and kerberos [17:39:01] Ok, will update the doc removing the typo and adding kerberos in there [17:39:13] thanks for that :] [17:39:25] !log Rerun refine for mediawiki_api_request failed hour [17:39:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:40:39] joal: one question also :D [17:41:09] please [17:41:37] I'm using maxmind db from a transform function for netflow Refine. Do I need to use caching? Or is that already handled on lower level? [17:42:11] mforns: are you using the code we defined in refinery-core? [17:42:25] I'm using ISPDatabaseReader [17:48:11] mforns: so you use refinery-core [17:48:29] yes [17:50:03] I don't think we have caching in place - this is actually a simple improvement (https://maxmind.github.io/MaxMind-DB-Reader-java/ --> caching) [17:51:08] mforns: --^ [17:51:37] mforns: looks like you can actually provide a patch that would improve all users of MaxMind! [17:51:49] mforns: from refinery I mean :D [17:51:50] :O [17:52:01] ok [17:52:30] going afk people! [17:52:39] Bye elukey [17:53:29] byeeee [17:54:20] ok joal thanks, we can create a task for that, in the docs it says "Please note that the cache will hold references to the objects created during the lookup. If you mutate the objects, the mutated objects will be returned from the cache on subsequent lookups." We probably do not alter results from maxminddb, but we should check. [17:54:54] mforns: I'm pretty we don't as we actually rewrite results IIRC - Let's check [17:55:35] created a task: https://phabricator.wikimedia.org/T265516 [17:55:41] thanks joal!! :] [17:56:13] Oh no mforns - Thanks to you for that !!! [18:07:06] ottomata: what do you think of moving the netflow refined data from /wmf/data/wmf/netflow to /wmf/data/event/netflow? This way we could include it in the existing sanitization process? [18:45:57] mforns: makes sense to me [18:59:29] (03PS1) 10Ottomata: Safeguards when using EventStreamConfig in EtlInputFormat [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/634067 (https://phabricator.wikimedia.org/T251609) [19:04:32] (03PS3) 10Ottomata: Use camus + EventStreamConfig integration in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) [19:06:41] joal: when you have time [19:06:52] https://gerrit.wikimedia.org/r/c/analytics/camus/+/634067 [19:06:52] https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/629377 [19:10:30] ottomata: reading [19:11:13] (03CR) 10jerkins-bot: [V: 04-1] Use camus + EventStreamConfig integration in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:30:51] (03CR) 10Joal: Safeguards when using EventStreamConfig in EtlInputFormat (032 comments) [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/634067 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:31:38] ottomata: --^ happy to discuss about my comment here - Not commenting on the second CR since it's linked to the comment above [19:32:26] yeah [19:32:29] its confusing [19:32:37] so, Camus's deault is [19:32:42] if no whitelist or blacklist [19:32:44] import all topics. [19:32:47] yup [19:32:51] CamusPartitionCheckers default is [19:32:58] if no whitelist or blacklist, check all topics in offset files [19:33:24] when using EventStreamConfig to get whitelist [19:33:36] i don't want to go with Camus's default, i want a way to stop camus from proceeding [19:33:58] but, i want the behavior for CamusPartitionChecker to be: check all topics in offset files [19:34:46] if no whitelist is found/provided [19:35:02] does that make sense? [19:36:03] ottomata: it does, and no whitelist should also be no eventStreamConfig provided, right? [19:37:05] well, for camus i'm safeguarding against the case where EventStreamConfig finds no topics [19:37:24] due to misconfiguration of either camus eventstreamconfig.* properties, or to the wgEventStream config in mw config [19:37:34] since the topics are now dynamic [19:37:41] i don't want to accidentally schedule a camus job that imports all kafka topics [19:37:57] so what i'm trying to do is [19:38:04] My way of viewing it is: in eventStreamConfig mode, we need at least one stream match to proceed [19:38:07] when uusing evnetstreamconfig, make camus fail if no topics are found [19:38:18] but allow camuspartionchecker to proceed if no topics are found [19:38:23] joal: yes, for Camus we do [19:38:26] but not for CamusPartitionChecker [19:38:36] But why ottomata ? [19:39:03] hm, that's the behavior of CamusPartitonChecker now, i guess i didn't wnat to change it: if no whitelist is provided, all previously imported topics ale checked. [19:39:05] although... [19:39:08] The config for the partition-checker should be the same as for camus job as it works with the same base data, shouldn'tit? [19:39:09] i guess, yeah. [19:39:15] maybe we WANT to stop checking them. [19:39:23] in that case we should use whatever EventStreamConfig gives us [19:39:24] hm... [19:39:35] joal: no [19:39:45] we are only going to check canary_events topics [19:39:51] right now [19:40:01] we only check high throughput topics; ones we know have data in them every hour [19:40:22] hm but yes. [19:40:29] we could say [19:40:40] when using CamusPartitionChecker + EventStreamConfig, if no topics match, fail. [19:40:44] i guess that's an ok behavior. [19:40:56] i was trying to keep the CamusPartitionChecker the same [19:41:09] but yeah, it might be nice to automatically be able to stop checking certain topics [19:41:11] That's my view as well - If we say that no-topic in eventStream Mode is an error, it should be an error for checker as well [19:41:24] e.g. if we stop producing canary_events to a stream, we should stop checking its topics [19:41:31] ok [19:41:39] yeah joal sounds good i'll do that then, that does make more sense. [19:41:44] if we stop ingesting a topic, we should stop checking it [19:41:50] awesome :) [19:42:17] It'll make the understanding of the code easier as well ottomata :) [20:04:27] (03CR) 10Ottomata: Safeguards when using EventStreamConfig in EtlInputFormat (032 comments) [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/634067 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:05:23] (03PS2) 10Ottomata: Safeguards when using EventStreamConfig in EtlInputFormat [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/634067 (https://phabricator.wikimedia.org/T251609) [20:05:41] (03PS4) 10Ottomata: Use camus + EventStreamConfig integration in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) [20:06:01] done joal ^^ [20:09:46] (03CR) 10jerkins-bot: [V: 04-1] Use camus + EventStreamConfig integration in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:12:37] ottomata: about the streamName being null - I understand now having looked into event-utilities that a null streamName means any-stream [20:13:41] It's not explicit in the way it is handled in the camus code - should it be (either through doc or even log)? [20:14:50] (03PS3) 10Conniecc1: Add dimensions to editors_daily dataset [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607361 (https://phabricator.wikimedia.org/T256050) [20:17:31] hm, joal i think usually it will be null, but i'll add some comments [20:17:56] oh i did add a comment [20:18:12] i'll add it to docs on config param [20:18:39] thanks ottomata - sorry for having missed the comment [20:19:05] (03PS3) 10Ottomata: Safeguards when using EventStreamConfig in EtlInputFormat [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/634067 (https://phabricator.wikimedia.org/T251609) [20:22:31] (03CR) 10Joal: "LGTM - Thanks ottomata for the updates" (032 comments) [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/634067 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:28:12] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Safeguards when using EventStreamConfig in EtlInputFormat [analytics/camus] (wmf) - 10https://gerrit.wikimedia.org/r/634067 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:29:27] (03CR) 10Joal: [C: 03+1] "Code looks good - I've not followed jar versions but I trust you" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:33:46] (03CR) 10Ottomata: "recheck" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)