[00:36:43] hmm, so it looks like what joal did was pass the explicit partitioned path into the job, write it as a plain directory, and then i'm not seeing where but i remember him mentioning before to ask hive to re-calculate table partitions by looking at the disk [00:38:07] probably via repair_partitions.hql [00:45:23] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, looks like a result of possibly malicious activity - https://phabricator.wikimedia.org/T144100 (10awight) This is pedantic of me, but I want to break the subtask link to T152628 and make it a "me... [00:45:58] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10awight) [00:48:46] 10Analytics, 10Analytics-Cluster: Upgrade Hive to ≥ 2.0 - https://phabricator.wikimedia.org/T203498 (10Neil_P._Quinn_WMF) [01:10:29] 10Analytics, 10Analytics-Kanban: Provide edit tags in the Data Lake edit data - https://phabricator.wikimedia.org/T161149 (10Neil_P._Quinn_WMF) Just FYI: not having this has created a slight problem with the February board metrics (T218055), since I could no longer use the change tags from dbstore1002 and I ha... [01:18:59] 10Analytics, 10Analytics-Kanban: Provide edit tags in the Data Lake edit data - https://phabricator.wikimedia.org/T161149 (10Nuria) Change tag tables (not as part of mw history) are in scooped in hadoop. See: select count(*) from mediawiki_change_tag where wiki_db="eswiki" and snapshot="2019-02"; So even if... [01:23:29] 10Analytics, 10Analytics-Cluster: Upgrade Hive to ≥ 2.0 - https://phabricator.wikimedia.org/T203498 (10Tbayer) [01:37:40] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10awight) In https://phabricator.wikimedia.org/diffusion/ANRE/browse/master/oozie/pageview/hourly/transform_pageview... [02:26:57] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-API, 10MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), and 2 others: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454 (10Krinkle) 05Open→03Resolved [05:26:34] (03PS8) 10Mill: ibaaaaaaaaaaaa [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/494831 (owner: 10Ottomata) [05:26:41] (03PS3) 10Mill: %26baaaaaaaaaaaa [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (owner: 10Bmansurov) [05:26:53] (03PS12) 10Mill: pbaaaaaaaaaaaa [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/492399 (owner: 10Ottomata) [05:36:37] (03PS5) 10Mill: ciaaaaaaaaaaaa [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/494241 (owner: 10Fdans) [05:54:28] (03PS5) 10Mill: fpaaaaaaaaaaaa [analytics/refinery] - 10https://gerrit.wikimedia.org/r/484250 (owner: 10Mforns) [05:57:32] (03PS5) 10Mill: fraaaaaaaaaaaa [analytics/refinery] - 10https://gerrit.wikimedia.org/r/491791 (owner: 10Elukey) [07:09:43] the above spam from wikibugs is sadly vandalism [07:22:58] I am about to run errand for passport renewal, ttl! [07:53:46] later elukey [07:58:40] Hi ebernhardson - You're absolutely right in your understanding of how hive partitions are inserted when Spark doens't manage to do it by itself :) [08:24:23] 10Analytics, 10Product-Analytics, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), 10Patch-For-Review: Standardize datetimes/timestamps in the Data Lake - https://phabricator.wikimedia.org/T212529 (10JAllemandou) Absolutely right @ottomata - We wanted to use Timestamps type to facilitate applying functions, bu... [08:38:00] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10JAllemandou) thanks for working on this @awight :) >>! In T144100#5034702, @awight wrote: > Maybe it's desirabl... [08:47:08] joal: bonjour! [08:47:24] I have a couple of mins to drop a question for you, you can answer whenever you have time :) [08:47:41] I have been trying to increase logging for mapreduce jobs on the testing cluster [08:47:59] the oozie/hive webrequest-load-test steps are still failing for TLS issues [08:48:12] there are useful debugging directives like https://github.com/apache/hadoop/blob/branch-2.6.0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java#L445 [08:48:17] that I'd love to print [08:48:41] but after following a ton of suggestions here and there I didn't find anything working [08:48:59] I guess Fetcher.java needs a change in log4j [08:49:22] hm [08:49:25] there is one in /etc/hadoop/conf but IIRC yesterday when I tried it wasn't working [08:49:28] I'll have allok :) [08:49:34] thanks :) [08:49:54] don't spend too much time, I'll do the research, I was only asking since you usually know these tricky questions :) [08:50:04] wow - allok - Seems today is m new-words day :) [08:50:23] elukey: I don't know per say, but I can try :) [08:50:48] elukey: to be sure - the job you're looking at is a hive job, tight? [08:50:51] right? [08:51:04] it is a hive action of an oozie job yes [08:51:16] andrew yesterday suggested https://oozie.apache.org/docs/3.3.1/DG_HiveActionExtension.html [08:51:35] but I haven't tried it yet [08:52:03] elukey: my first step will be to run the hive query manually, and try to log from there [08:52:36] joal: this is a very good point, I have always re-ran the oozie job [08:52:52] the step that fails is generate_sequence_statistics [08:53:30] if I run the query via hive and fails as well, then I can only work on that bit [08:53:45] * elukey takes notes [08:53:50] this is a very good suggestion [08:54:29] my suspicion is that, for some reason, the reduce step tries to fetch from the shuffler via HTTP [08:54:32] not https [08:54:43] ok [08:55:21] all right will try in ~2h [08:55:31] going to get my passport (hopefully) [08:55:32] :) [08:55:36] thanks joal! [08:55:44] elukey: my assumption is that logger parameterization is done using log4j files, and the more layers of possible existence/overwriting of that file, the harder to get it to do what you want [08:55:47] when I am back if you have time we can work on sqoop [08:56:00] elukey: sqoop succeeded this weekend :) [08:56:10] ah yes but in 18h right? [08:56:20] elukey: I'm gonna patch my code based on Dan and Nuria CR, and that should do [08:56:20] I thought it was a very gentle try [08:56:37] correct elukey - We can do faster, but I think it's not urgent as of now [08:56:54] sure sure, I meant simply increasing mappers etc.. [08:56:57] nothing super fancy [08:57:04] anyway, good for me :) [08:57:06] yup, we can discuss this :) [08:57:13] if you need my hel I'll be available [08:57:15] Good luck with the passport [08:57:16] *help [08:57:21] ahahha thanks [08:57:50] ah Gerrit is still down for maintenance as FYI [08:57:59] yup [08:59:35] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10awight) Reading [[ https://issues.apache.org/jira/browse/HIVE-5672 | HIVE-5672 ]], the root bug has been fixed and... [09:02:08] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10JAllemandou) > Reading [[ https://issues.apache.org/jira/browse/HIVE-5672 | HIVE-5672 ]], the root bug has been fi... [09:02:38] 10Analytics: Review parent task for any potential pageview definition improvements - https://phabricator.wikimedia.org/T156656 (10awight) @Milimetric Would you mind pointing me to the definition this task will update? If there are formatting changes to how fields are delimited and escaped, we will need to find... [09:04:32] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10awight) >>! In T144100#5035139, @JAllemandou wrote: >> Reading [[ https://issues.apache.org/jira/browse/HIVE-5672... [11:24:32] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-API, 10MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), and 2 others: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454 (10zeljkofilipin) @Krinkle the commit is merged in wmf.21, but not deployed? Is... [12:47:37] back :) [12:47:56] passported elukey, or not yet? [12:48:08] not yet, will take ~45d [12:48:21] \/o\ [12:49:21] this was only the request paperwork :) [12:49:36] ok [12:50:28] If it works as in france, you'll normally get notified in ~40 days that your paperwork is incomplete, and than you'll need formular E308-Z-302 to be field before waiting another 45 days ;) [12:57:16] ah! So France and Italy are not that different :D [12:58:53] :) [12:59:06] elukey: please let me know if I can help with he loging for hive [12:59:43] sure! I am going to start in a bit sending the hive command via hive -f and see if I can repro [13:04:49] heya joal yt? [13:04:57] yessir :0 [13:05:00] :) [13:05:06] how are you ottomata ? [13:05:16] am well! blocked by this gerrit outage :/ [13:05:21] so am working on refine schema stuff :) [13:05:29] joal wanted to ask your opinion [13:05:41] nuria and I were having another java code debate yesterday :) [13:05:59] so e.g. in that EventLoggingSchemaLoader class [13:06:04] i have a bunch of public methods [13:06:33] the only one that EventSparkSchemaLoader users is the getEventSchema(JsonNode event) or getEventSchema(String jsonEvent) [13:06:41] but i also have things like [13:06:49] getEventLoggingSchema(String schemaName) [13:06:51] or [13:06:57] encapsulateEventLoggingSchema [13:06:57] etc. [13:07:10] nuria wanted only getEventSchema to be public [13:07:37] i want them all (or most to be public) for A. ease of unit testing, and B. to be able to use them on the spark repl when debugging [13:07:41] and troubleshooting refine problems [13:07:58] I didn't convince her and she didn't convince me [13:07:59] so :) [13:08:12] we are at an impasse [13:08:31] Argh :) [13:08:41] so now I'm making you be King Solomon to pass your impartial judgement :p [13:08:59] * joal wonders if being caught in an impasse between nuria and ottomata is a good position [13:09:10] hahah i'm sure it isn't! [13:09:14] also she's not awake yet! :D [13:09:25] :) [13:09:29] so maaybe not fair! [13:09:42] ah i jusst want to work on stuff but gerrit is really pretty necessary eh?! [13:10:14] ottomata: I can see light at both you end of tunnels - Now the question is which one to pick, and obiously, why [13:11:11] one of my arguments was that I'd be the one debugging the stuff when it broke, and being able to call the individual methods in the repl is very handy for that [13:11:24] nuria said all that could be covered with good logging [13:11:42] which is true, if your logging covers all cases, which it never does [13:11:53] so i don't want to make a code change and compile in order to test [13:11:58] to add logging [13:12:17] nuria's reason for not wanting public methods is to keep public API minimal [13:12:19] whiich is a good reason [13:12:43] but I think not important in refinery-source, especially as a standalone Event/EventLogging schema class [13:12:53] i think its a helper class and various helper methods are useful [13:16:43] ottomata: Could we add a separate helper class whose job is actually to handle such public methods? [13:17:52] hm, i guess, but why? [13:18:12] wouldn't that just move the problem? [13:18:29] ottomata: gerrit back up :) [13:18:33] oh goodie [13:19:20] in some ways yes, in term of software organization however, it makes your main class responsible to its only job, and other functions maintained in other places [13:19:44] hm the class uses those functions [13:19:52] they are used, but just by the class itself [13:19:53] IMO simple APIs is to be thought in the scope of objects themselves [13:19:57] only a couple are called directly from refine [13:20:01] With the single-resp principle [13:20:12] yes....ok now i can link class [13:20:21] https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/492399/11/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/jsonschema/EventLoggingSchemaLoader.java [13:20:50] ottomata, joal - as FYI today there is network maintenance for row A - https://phabricator.wikimedia.org/T187960 [13:20:58] it will likely result in some errors etc.. [13:21:04] will let you know when it starts [13:21:09] ok [13:21:10] ok - some workers will get outages I gurss [13:21:21] elukey: does gerrit not have 2FA? [13:21:30] sadly yes [13:22:17] elukey: spam issue was related to account pawned? [13:22:25] joal maybe i can make just a few of the methods protected there, e.g. encapsulateEventLoggingSchema and buildEventLoggingCapsule and it will make nuria happier [13:22:32] joal, not the spam issue iiuc [13:22:37] the one from this morning/last night [13:22:45] but, i just had to log in again and was thinking abou tit [13:22:54] right [13:23:06] joal: I think that the SRE team will send a detailed post mortem very soon [13:23:11] i'm going to make a couple of these methods protected and see if i can squeeze a +1 out of nuria later :p [13:23:33] ottomata: issue about protexted is it doesn't solve your spark-repl usage [13:23:45] well, these couple of methods i don't worry about so much [13:23:48] its the other ones i want [13:24:00] i want to be able to get the schema from a string or a jsonnode, or a uri [13:24:04] or a schema name [13:24:14] ottomata: if they're usefull in some ways, having they're own life and places make sense? [13:24:22] i probably won'tneed to call e.g. encapsulate() directly [13:24:36] joal: ya but i i think they belong here, no? [13:25:04] this class already extends EventSchemaLoader [13:25:11] and separates out the eventlogging specific stuff [13:26:20] the question is whether the separated stuff is to be visible or not [13:26:39] If it is, i needs its own space (not shared with a schema-loader) [13:28:45] hm joal . ok so you are saying tha tI should make a wrapper class for EventLoggingSchemaLoader? [13:28:53] aand put the non overriding methods in it? [13:29:04] e.g. where does this belong? [13:29:08] ottomata: small typos in logs on that file (lines 202 and 225) [13:29:09] https://www.irccloud.com/pastebin/OCThso4W/ [13:29:24] (thanks) [13:30:48] Really the eventLoggingSchemaUriFor methods could be made private/protected [13:30:56] Same for encapsulateEventLoggingSchema [13:31:27] Except if we say we need them for other purposes, and in that case they need their spaces [13:31:30] IMO [13:32:07] those method don't belong in the schema-loader in term of its public API [13:32:11] alright, i will make some protected and see how it goes [13:32:16] k [13:32:24] i want getEventLoggingSchema to be public though [13:32:37] Trying to push toward a solution that would make both of you happy [13:32:43] this is why i asked you :) [13:33:06] :) [13:34:18] You could have a EventLoggingURIBuilder owning the URI stuff, and make encapsulateEventLoggingSchema private [13:34:59] yar, wish I had LogHelper in Java :p [13:44:01] joal: [13:44:02] 2019-03-19 13:39:18,516 DEBUG [fetcher#5] org.apache.hadoop.mapreduce.task.reduce.Fetcher: MapOutput URL for analytics1031.eqiad.wmnet:13562 -> http://analytic [13:44:05] \o/ [13:44:08] s1031.eqiad.wmnet:13562/mapOutput?job=job_1552661428200_0762&reduce=0&map=attempt_1552661428200_0762_m_000001_0 [13:44:11] it is using http indeed [13:44:29] elukey: you have managed to log I guess? [13:44:34] I had to add the following to the hive scirpt [13:44:34] SET mapreduce.map.log.level=DEBUG; [13:44:34] SET mapreduce.reduce.log.level=DEBUG; [13:44:35] SET yarn.app.mapreduce.am.log.level=DEBUG; [13:44:39] and then all good [13:44:44] Awesome :) [13:44:49] NICE [13:45:13] now I got the confirmation of my suspicions, but of course remains to understand the why sigh [13:45:19] * elukey digs into logs [13:47:05] (03PS13) 10Ottomata: Event(Logging) schema loader [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/492399 (https://phabricator.wikimedia.org/T215442) [13:48:46] (03CR) 10Ottomata: Event(Logging) schema loader (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/492399 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [13:49:02] joal: ya could, man there are just way too many classes and files in Java sometimes. [13:49:08] anyway, i made the non loading methods protected [13:49:25] https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/492399/13/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/jsonschema/EventLoggingSchemaLoader.java [13:51:45] k ottomata [14:15:32] 10Analytics, 10Core Platform Team (Modern Event Platform (TEC2)): Ingest api data (for posts) into druid - https://phabricator.wikimedia.org/T218348 (10kchapman) [14:16:16] 10Analytics, 10Core Platform Team Backlog, 10Core Platform Team (Modern Event Platform (TEC2)): Ingest api data (for posts) into druid - https://phabricator.wikimedia.org/T218348 (10kchapman) [14:17:31] 10Analytics, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External): Ingest api data (for posts) into druid - https://phabricator.wikimedia.org/T218348 (10kchapman) [14:25:40] 10Analytics, 10Product-Analytics, 10Core Platform Team Kanban (Done with CPT), 10MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), 10Services (done): `rev_parent_id` and `rev_content_changed` are missing in event.mediawiki_revision_tags_change - https://phabricator.wikimedia.org/T218274 (10Pchelolo) 05Open→0... [14:38:09] 10Analytics, 10EventBus: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10Ottomata) [14:45:15] so now I can see [14:45:16] 2019-03-19 14:32:38,805 DEBUG [fetcher#2] org.apache.hadoop.mapreduce.task.reduce.Fetcher: MapOutput URL for analytics1032.eqiad.wmnet:13562 -> https://analytics1032.eqiad.wmnet:13562/mapOutput?job=job_ [14:45:20] etc.. [14:45:23] that is better [14:45:32] but for some reason the job gets stuck anyway :D [14:52:39] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-API, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), and 2 others: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454 (10Jdforrester-WMF) Deployed, but not up with the on-wiki SAL; they appear in t... [14:52:53] 10Analytics, 10Analytics-EventLogging, 10MediaWiki-API, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), 10good first bug: ApiJsonSchema implements ApiBase::getCustomPrinter for no good reason - https://phabricator.wikimedia.org/T91454 (10Jdforrester-WMF) [14:55:07] 10Analytics, 10Scoring-platform-team (Current): [Discuss] ORES model development and deployment processes - https://phabricator.wikimedia.org/T216246 (10Halfak) a:05Halfak→03None This discussion seems to be stalled. I'm not sure that it should be assigned to me. @nuria, did you have any specific goals yo... [14:59:25] (03PS1) 10Bearloga: Add whitelisting for mobile app Suggested Edits schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/497499 (https://phabricator.wikimedia.org/T218594) [15:05:02] network maintenance starting [15:05:17] an-master1001 is in the group, I'll do a failover later on [15:09:08] 10Analytics, 10Scoring-platform-team (Current): [Discuss] ORES model development and deployment processes - https://phabricator.wikimedia.org/T216246 (10Nuria) Rather than close you can move to blocked and leave it open , i do not think anything is happening in the near future. [15:09:16] 10Analytics, 10Scoring-platform-team (Current): [Discuss] ORES model development and deployment processes - https://phabricator.wikimedia.org/T216246 (10Nuria) 05Open→03Stalled [15:09:57] 10Analytics, 10Scoring-platform-team (Current): [Discuss] ORES model development and deployment processes - https://phabricator.wikimedia.org/T216246 (10Nuria) Moved to ML from radar column. [15:13:30] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10Nuria) >I think the suggestion of escaping all white-space characters (end-of-lines, spaces, tabs etc) actually ma... [15:14:23] might i need to change anything about our refinery-drop-hive-partitions jobs? The ones running in beta cluster email me regularly about kerberos problems and failling [15:17:42] ebernhardson: beta cluster? [15:17:47] you mean the analytics project? [15:17:48] elukey: cloud? [15:17:59] I didn't know that you were receiving emails for those :( [15:18:12] "Beta" for me is deployment-prep [15:18:15] this is why I was asking [15:18:28] well, that's probably true. But i guess i've called the prod system in cloud beta :P [15:19:33] but basically yes, whatever you setup to test kerberos in beta cluster, i'm getting alert emails from hdfs@hadoop-coordinator-2, i think the important part of error is: [15:19:36] avax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hadoop-coordinator-2.analytics.eqiad.wmflabs/172.16.2.237"; destination host is: "hadoop-master-4.analytics.eqiad.wmflabs":8020 [15:20:04] yes yes in there the work is still wip, I am working now on the test prod cluster [15:20:17] ok, i've mostly been ignoring those emails so will continue to do so :) [15:20:27] super, sorry for the noise [15:21:06] no worries [15:23:43] ebernhardson: totally unrelated qs - ok for the GPU then right? do we need to check other things like specific height/width/etc.. before buying from newegg? [15:24:51] elukey: i'm pretty comfortable that the OEM WX9100 from AMD is the same dimensions as what's already in there, within probably 1% or so. It looks like it should all fit [15:25:17] (03CR) 10Nuria: [C: 03+2] Add whitelisting for mobile app Suggested Edits schema [analytics/refinery] - 10https://gerrit.wikimedia.org/r/497499 (https://phabricator.wikimedia.org/T218594) (owner: 10Bearloga) [15:25:22] it won't have that fancy rear bracket that came from dell, not sure how important that is. Might be a shipping concern [15:25:37] maybe it can be moved across between cards not sure [15:25:55] (the bracket is just an L that attaches to end of card and it looks like the dell case has a little pin or whatever that keeps it positioned from the rear [15:26:21] dell, or hp? whichever vendor the server came from :) [15:28:36] dell :) [15:40:42] doing the failover now [15:40:55] the main problem seems to be yarn not getting active on an-master1002 [15:48:19] wow something weird happened, I had to kill yarn on an-master1001 [15:49:31] oh my I just realized that this is the first restart after the change from zk to hdfs [15:50:35] PROBLEM - Hadoop NodeManager on analytics1059 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:50:51] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:50:52] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:01] PROBLEM - Hadoop NodeManager on analytics1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:01] PROBLEM - Hadoop NodeManager on an-worker1086 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:05] PROBLEM - Hadoop NodeManager on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:06] yeah this is me [15:51:07] PROBLEM - Hadoop NodeManager on an-worker1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:07] PROBLEM - Hadoop NodeManager on analytics1046 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:08] PROBLEM - Hadoop NodeManager on analytics1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:09] PROBLEM - Hadoop NodeManager on an-worker1084 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:11] PROBLEM - Hadoop NodeManager on analytics1051 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:11] PROBLEM - Hadoop NodeManager on analytics1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:11] PROBLEM - Hadoop NodeManager on analytics1066 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:12] PROBLEM - Hadoop NodeManager on analytics1075 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:13] PROBLEM - Hadoop NodeManager on analytics1076 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:16] PROBLEM - Hadoop NodeManager on analytics1053 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:17] PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:19] PROBLEM - Hadoop NodeManager on analytics1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:19] PROBLEM - Hadoop NodeManager on analytics1052 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:21] PROBLEM - Hadoop NodeManager on analytics1045 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:23] PROBLEM - Hadoop NodeManager on analytics1056 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:24] PROBLEM - Hadoop NodeManager on an-worker1078 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:25] PROBLEM - Hadoop NodeManager on analytics1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:25] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:27] PROBLEM - Hadoop NodeManager on analytics1060 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:29] PROBLEM - Hadoop NodeManager on an-worker1079 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:35] PROBLEM - Hadoop NodeManager on analytics1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:45] PROBLEM - Hadoop NodeManager on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:57] PROBLEM - Hadoop NodeManager on an-worker1082 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:51:57] PROBLEM - Hadoop NodeManager on analytics1061 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:52:01] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:52:07] PROBLEM - Hadoop NodeManager on an-worker1091 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:52:45] PROBLEM - Hadoop NodeManager on an-worker1095 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:53:36] so the yarn master for some reason is down and the nodemanager didn't like it [15:53:38] aahhhhhh!!!! [15:53:59] a-team: remember standup is in 30 mins [15:54:03] PROBLEM - YARN NodeManager Node-State on an-worker1080 is CRITICAL: CRITICAL: YARN NodeManager an-worker1080.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:01 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:54:03] PROBLEM - YARN NodeManager Node-State on an-worker1089 is CRITICAL: CRITICAL: YARN NodeManager an-worker1089.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:01 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:54:04] PROBLEM - YARN NodeManager Node-State on analytics1071 is CRITICAL: CRITICAL: YARN NodeManager analytics1071.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:54:04] PROBLEM - YARN NodeManager Node-State on analytics1075 is CRITICAL: CRITICAL: YARN NodeManager analytics1075.eqiad.wmnet:8041 Node-State: 19/03/19 15:54:02 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:55:56] PROBLEM - YARN NodeManager Node-State on analytics1050 is CRITICAL: CRITICAL: YARN NodeManager analytics1050.eqiad.wmnet:8041 Node-State: 19/03/19 15:55:55 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to an-master1002-eqiad-wmnet [15:56:22] PROBLEM - Hadoop NodeManager on analytics1057 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:56:46] so I think we should get recovery in a bit [15:57:01] the failover for some reason was super slow [15:57:46] RECOVERY - YARN NodeManager Node-State on an-worker1079 is OK: OK: YARN NodeManager an-worker1079.eqiad.wmnet:8041 Node-State: RUNNING [15:57:50] RECOVERY - Hadoop NodeManager on analytics1045 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:57:52] RECOVERY - Hadoop NodeManager on analytics1056 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:57:54] RECOVERY - Hadoop NodeManager on an-worker1078 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:57:56] RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:00] RECOVERY - Hadoop NodeManager on an-worker1079 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:07] RECOVERY - Hadoop NodeManager on analytics1043 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:12] sigh [15:58:18] RECOVERY - YARN NodeManager Node-State on analytics1052 is OK: OK: YARN NodeManager analytics1052.eqiad.wmnet:8041 Node-State: RUNNING [15:58:28] RECOVERY - Hadoop NodeManager on an-worker1082 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:28] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:30] RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:58:32] RECOVERY - YARN NodeManager Node-State on analytics1077 is OK: OK: YARN NodeManager analytics1077.eqiad.wmnet:8041 Node-State: RUNNING [15:58:32] RECOVERY - YARN NodeManager Node-State on analytics1071 is OK: OK: YARN NodeManager analytics1071.eqiad.wmnet:8041 Node-State: RUNNING [15:58:33] RECOVERY - YARN NodeManager Node-State on an-worker1080 is OK: OK: YARN NodeManager an-worker1080.eqiad.wmnet:8041 Node-State: RUNNING [15:58:33] RECOVERY - YARN NodeManager Node-State on an-worker1081 is OK: OK: YARN NodeManager an-worker1081.eqiad.wmnet:8041 Node-State: RUNNING [15:59:47] I have no idea why this mess didn't happen the last time [15:59:50] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:59:51] RECOVERY - YARN NodeManager Node-State on analytics1070 is OK: OK: YARN NodeManager analytics1070.eqiad.wmnet:8041 Node-State: RUNNING [15:59:51] RECOVERY - YARN NodeManager Node-State on an-worker1091 is OK: OK: YARN NodeManager an-worker1091.eqiad.wmnet:8041 Node-State: RUNNING [15:59:52] RECOVERY - YARN NodeManager Node-State on an-worker1086 is OK: OK: YARN NodeManager an-worker1086.eqiad.wmnet:8041 Node-State: RUNNING [15:59:53] RECOVERY - YARN NodeManager Node-State on analytics1051 is OK: OK: YARN NodeManager analytics1051.eqiad.wmnet:8041 Node-State: RUNNING [15:59:54] RECOVERY - YARN NodeManager Node-State on analytics1064 is OK: OK: YARN NodeManager analytics1064.eqiad.wmnet:8041 Node-State: RUNNING [15:59:55] RECOVERY - YARN NodeManager Node-State on analytics1075 is OK: OK: YARN NodeManager analytics1075.eqiad.wmnet:8041 Node-State: RUNNING [15:59:57] RECOVERY - YARN NodeManager Node-State on an-worker1084 is OK: OK: YARN NodeManager an-worker1084.eqiad.wmnet:8041 Node-State: RUNNING [15:59:58] RECOVERY - YARN NodeManager Node-State on analytics1049 is OK: OK: YARN NodeManager analytics1049.eqiad.wmnet:8041 Node-State: RUNNING [15:59:58] RECOVERY - YARN NodeManager Node-State on an-worker1092 is OK: OK: YARN NodeManager an-worker1092.eqiad.wmnet:8041 Node-State: RUNNING [16:01:18] a-team: yarn recovered, hdfs RM storage is probably at fault [16:01:33] it took a loong time for the yarn rm to recover [16:01:36] like minutes [16:01:48] I have no explanation for that [16:02:02] the data on hdfs shouldn't be that big [16:02:11] I have restarted the hdfs namenode before that [16:02:27] but it was already failed over to an-master1002 [16:03:35] eh? [16:03:44] RM storage? [16:04:00] yes the Yarn RM state [16:04:11] all the application statues etc.. [16:04:16] that are now on HDFS [16:04:21] (before on zk) [16:04:29] for some reason it took a while to load that [16:04:42] and the yarn RM on an-master1002 took minutes [16:04:48] ohhh [16:04:50] interesting [16:04:51] so all the yarn nodemanagers shutdown [16:04:57] huh [16:05:02] because no active master was available [16:05:23] hm, is that then maybe a reason not to have the state on hdfs? [16:05:24] I am following the network maintenance now so I can't check logs in depth [16:05:28] ok [16:05:39] it could be, but minutes to load some mb? [16:05:42] 10Analytics, 10Knowledge-Integrity, 10Research, 10Epic, 10Patch-For-Review: Citation Usage: run third round of data collection - https://phabricator.wikimedia.org/T213969 (10bmansurov) Talked to Miriam, and she made an announcement today. We'll wait two days and deploy on Thursday if everything is fine b... [16:11:48] 2019-03-19 15:47:38,646 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Loaded RM state version info 1.2 [16:11:51] 2019-03-19 15:52:59,314 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up [16:11:54] 2019-03-19 15:53:58,860 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore: Done loading applications from FS state store [16:11:57] 15:47 -> 15:53 [16:12:04] that is insane [16:12:22] 2019-03-19 15:53:58,895 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: recovering RMDelegationTokenSecretManager. [16:12:25] 2019-03-19 15:53:58,957 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 10012 applications [16:12:35] 10Analytics, 10EventBus: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10mobrovac) This should be doable by adding `x-amples` to the service's spec. While technically it won't achieve exactly that, having a POST example that sends a test event will... [16:13:18] elukey@an-master1002:~$ hdfs dfs -du -h -s /user/yarn/rmstore [16:13:18] Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 [16:13:18] 137.8 M 413.4 M /user/yarn/rmstore [16:13:29] so 6 minutes for 137M? [16:16:58] so now I am wondering if this is due to the namenode being failed over [16:22:59] 10Analytics, 10Knowledge-Integrity, 10Research, 10Epic, 10Patch-For-Review: Citation Usage: run third round of data collection - https://phabricator.wikimedia.org/T213969 (10Miriam) Yes, announcement just posted! [16:31:42] a-team : standdupp , ottomata [16:31:53] a-team: not sure if milimetric can make it today [16:32:01] OO coming [16:32:01] 10Analytics, 10EventBus: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10Ottomata) Good idea! The test x-amples are there. We should add a custom spec for the wikimedia-eventgate implementation with our x-ample event. [16:32:42] yeah sorry Nuria, baby’s not asleep yet [16:32:56] I’ll check any ops week stuff when she naps [16:58:41] elukey: btw, you made this varnishkafka dashboard, ya? [16:59:02] https://grafana.wikimedia.org/dashboard/db/varnishkafka - this one? [16:59:05] ya [17:02:45] Hey team - back from kids [17:02:53] elukey: anythin I could help with related on yarn failure? [17:03:47] 10Analytics, 10EventBus: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10akosiaris) The readiness probe can't really be POST. The ref is here https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#probe-v1-core, it only allows `httpGet... [17:04:02] joal: need to check in the logs, I think that the safe option is to go back to zookeeper [17:04:12] :( [17:04:14] hm [17:04:47] elukey: could we do some trial on Thursday to see if we could repro? [17:06:06] fwiw yarn.wikimedia.org is redirecting to an-master1002.eqiad.wmnet, so basically not working unless you have a SOCKS proxy setup [17:06:12] yeah [17:06:31] joal: if you have time I'd like to only restart the yarn RM on an-master1002 [17:06:34] and see how it goes [17:07:07] I'm here elukey [17:07:13] so what I know is that it took ~7mins for the yarn RM on an-master1002 to pick up ~130M of HDFS status [17:07:17] that is insane [17:07:17] Please tell how you wanna proceed :) [17:07:34] and I am sure that the HDFS NM on 1002 was marked as active [17:07:35] BUT [17:07:37] elukey: I have a question related to that [17:07:49] one thing that Andrew suggested/asked was if the NM was in safe mode [17:07:59] + the RM uses a back off to retry on HDFS [17:08:09] I think that the combination of both caused the issue [17:08:42] elukey: indeed - HDFS in read-only mode makes yarn fail [17:09:01] We knew that, but expected it wouldn't happen [17:09:11] I wonder why NN went in safe mode [17:09:21] it does every time it boots [17:09:23] too many datanode out? [17:09:29] hm [17:09:42] when it loads the last snapshot + the edit log remaining [17:09:47] before that it is in safe mode [17:09:55] but reads are allowed [17:09:56] makes sense - I was unaware of the NN reboot though [17:09:57] IIRC [17:10:30] NN reboot? [17:11:14] you told me NN goes in readonly mode when it boots- And I wondered why it happened now [17:11:21] ahhhh [17:11:23] sorry [17:11:25] didn't get it [17:11:58] :) [17:13:31] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10JAllemandou) >>! In T144100#5036380, @Nuria wrote: >>I think the suggestion of escaping all white-space characters... [17:15:18] 10Analytics, 10Datasets-General-or-Unknown, 10Security, 10good first bug: Pageview dumps incorrectly formatted, need to escape special characters - https://phabricator.wikimedia.org/T144100 (10Nuria) Ah, yes, @JAllemandou is totally right as always. [17:16:25] joal: [17:16:29] something is odd though [17:16:29] yessir [17:16:33] ? [17:16:46] what I did was restarting the NM on an-master1001 [17:16:49] to failover on 1002 [17:17:17] so NM on 1001, as expected, is showing safe mode on/off [17:17:37] ah snap but it affects also the other one at this point [17:17:59] elukey: sorry, can we use full service names? NM for NodeManager, but then no safemode ... - Let's be explicit :) [17:18:16] So restart of NodeManager on 1001 [17:18:18] right? [17:18:21] sorry you are right, NM is not he right one, it is Name Node [17:18:25] too many acronyms [17:18:26] Ah ok :) [17:18:29] yes indeed :) [17:18:29] I'll use full ones [17:18:32] Thamnks ;) [17:18:44] so on an-master1002 [17:18:44] 2019-03-19 15:41:20,640 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 235 secs [17:18:47] 2019-03-19 15:41:20,641 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is OFF [17:18:51] err 1001 [17:18:59] this is the name node being restarted [17:19:08] ok [17:19:36] but from the logs, as I pasted above, the yarn resource manager on 1002 started to read the rm state at ~15:47 [17:19:36] As if no failover is my feeling [17:19:58] the safe mode took ~235s though [17:20:15] so ~4 mins [17:20:29] elukey: when shuting down namenode on 1001, 1002 should have picked active no? [17:20:36] therefore no 4min down [17:21:00] yes correct, assuming that the safe mode is only for the namenode restarted [17:21:19] this is my doubt now [17:21:51] I always had the assumption that only the restarted one would have needed to go through safe mode [17:22:09] elukey: can we batcave? [17:22:11] sure [17:22:13] should be easier [17:35:22] (03PS1) 10Addshore: +x bit for rollbackconfirmation/userprops.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497568 [17:35:35] (03CR) 10Addshore: [V: 03+2 C: 03+2] +x bit for rollbackconfirmation/userprops.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497568 (owner: 10Addshore) [17:35:44] (03Merged) 10jenkins-bot: +x bit for rollbackconfirmation/userprops.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497568 (owner: 10Addshore) [17:35:47] (03PS1) 10Addshore: +x bit for rollbackconfirmation/userprops.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497569 [17:36:34] (03CR) 10Addshore: [V: 03+2 C: 03+2] +x bit for rollbackconfirmation/userprops.php [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497569 (owner: 10Addshore) [17:40:10] * elukey afk for ~10m [17:44:09] (03PS1) 10Addshore: Include note in cron scripts about +x bit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497572 [17:44:28] (03CR) 10Addshore: [V: 03+2 C: 03+2] Include note in cron scripts about +x bit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497572 (owner: 10Addshore) [17:44:36] (03PS1) 10Addshore: Include note in cron scripts about +x bit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497573 [17:44:40] (03CR) 10Addshore: [V: 03+2 C: 03+2] Include note in cron scripts about +x bit [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/497573 (owner: 10Addshore) [17:51:36] 10Analytics, 10EventBus: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10Ottomata) Ya, was thinking it'd have to be exec, and then if we can/should use service_checker that'd be fine. @akosiaris do you think we shouldn't do this? [18:19:22] joal: https://github.com/apache/hadoop/blob/branch-2.6.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java#L224 [18:19:26] for when you are back [18:19:47] from what I can read in the code Yarn tries to pull each applicationId's directory, one at the time [18:21:03] anyway, we have ~8400 appids in the testing cluster [18:21:17] so I am going to try now to repro the mess that I did [18:26:24] wasn't able to repro in the testing cluster, but in there safe mode last ~35s [18:28:52] also confirmed that the safe mode is only for one of the namenodes [18:29:02] (the one being restarted) [18:38:55] elukey: back! [18:39:33] elukey: indeed, from the code the RM looks at folders one at a time [18:39:41] There are 10k of them on the prod cluster [18:42:18] ok I have a better idea about what has happened, I was too eager to act probably [18:42:38] joal: bc? [18:42:48] sure elukey [18:42:49] OMW [18:43:43] 10Analytics, 10Product-Analytics, 10Core Platform Team Kanban (Done with CPT), 10MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), 10Services (done): `rev_parent_id` and `rev_content_changed` are missing in event.mediawiki_revision_tags_change - https://phabricator.wikimedia.org/T218274 (10chelsyx) Thanks all! [19:41:29] 10Analytics, 10Product-Analytics, 10Core Platform Team Kanban (Done with CPT), 10MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), 10Services (done): `rev_parent_id` and `rev_content_changed` are missing in event.mediawiki_revision_tags_change - https://phabricator.wikimedia.org/T218274 (10Nuria) Super thanks @... [19:53:42] 10Analytics, 10EventBus: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10akosiaris) Well I have my reservations for sure. As I said we are talking about a service-checker run every `10s` (tunable, but it's a sensible default). While tunable, the com... [20:12:38] (03PS1) 10Joal: Correct mw user-history create event timestamp [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/497604 (https://phabricator.wikimedia.org/T218463) [20:14:09] 10Analytics, 10EventBus: EventGate Helm chart should POST test event for readinessProbe - https://phabricator.wikimedia.org/T218680 (10mobrovac) >>! In T218680#5038157, @akosiaris wrote: > If anything, it might make more sense to create a specialized GET `/healthz` endpoint that does just produces (and deletes... [20:20:28] 10Analytics: Update mediawiki-history subgraph-partitioner so that it uses [page/user]_id in addition to title/text - https://phabricator.wikimedia.org/T218130 (10JAllemandou) [20:47:32] 10Analytics, 10Analytics-Data-Quality, 10Product-Analytics, 10Patch-For-Review: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history - https://phabricator.wikimedia.org/T218463 (10JAllemandou) Thanks Neil for having raised this. I have found 3 issue... [20:47:54] ok - no time for sqoop patch today, will do hopefully tomorrow evening [20:48:00] Gone for tonight - Bye team [20:52:36] (03CR) 10Nuria: [C: 03+2] Event(Logging) schema loader [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/492399 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [20:56:50] (03Merged) 10jenkins-bot: Event(Logging) schema loader [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/492399 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [20:57:32] <3 nuria :) [21:04:23] 10Analytics, 10Analytics-Cluster, 10ORES, 10Scoring-platform-team, 10artificial-intelligence: Package dictionaries better for ORES models - https://phabricator.wikimedia.org/T217343 (10Harej) p:05Triage→03Low [21:04:48] 10Analytics, 10Analytics-Cluster, 10ORES, 10Scoring-platform-team, 10artificial-intelligence: Package dictionaries better for ORES models - https://phabricator.wikimedia.org/T217343 (10Harej) p:05Low→03High [21:05:41] 10Analytics, 10Analytics-Cluster, 10ORES, 10Scoring-platform-team, 10artificial-intelligence: Package dictionaries better for ORES models - https://phabricator.wikimedia.org/T217343 (10Harej) p:05High→03Low [21:13:01] ottomata: the woes of the singletons are OVER [21:13:08] ottomata: juas! [21:13:12] hahh [21:25:21] (03CR) 10Nuria: Correct mw user-history create event timestamp (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/497604 (https://phabricator.wikimedia.org/T218463) (owner: 10Joal) [21:28:15] 10Analytics, 10Dumps-Generation, 10ORES, 10Scoring-platform-team, and 3 others: [Epic] Make ORES scores for wikidata available as a dump - https://phabricator.wikimedia.org/T209611 (10Harej) p:05High→03Low [21:29:30] 10Analytics, 10Dumps-Generation, 10ORES, 10Scoring-platform-team, and 3 others: [Epic] Make ORES scores for wikidata available as a dump - https://phabricator.wikimedia.org/T209611 (10Harej) Having scores available as a dump is a great idea but unfortunately I don't think it's a pressing priority. (If you... [21:29:35] (03CR) 10Nuria: "I will let joseph merge, other than the comment around the implicit factory on patch 6 (that I know we are not going to change) looks good" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/494831 (https://phabricator.wikimedia.org/T215442) (owner: 10Ottomata) [21:51:48] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1004 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [21:52:04] hm i know elukey disabled that earlier ^ ? [21:53:01] ottomata: mmm I have re-enabled it [21:53:27] looking [21:55:10] it seems very spiky but working [21:55:37] mysql consumer died and restarted [21:55:46] https://grafana.wikimedia.org/d/000000505/eventlogging?panelId=12&fullscreen&orgId=1&from=now-24h&to=now-5m [21:55:52] yaml.scanner.ScannerError: while scanning for the next token [21:55:52] found character '\t' that cannot start any token [21:56:00] i think maybe the meta mw api failed fo ra minute? [21:56:25] still failing [21:57:10] ottomata: do you know why today there was that huge drop? [21:57:18] huge drop? [21:57:33] check my last link [21:57:34] no? [21:57:47] it was around my 4PM, was it when I stopped the daemons? [21:58:04] yikes [21:58:10] why did you stop the daemons? [21:58:57] did they get restarted with some new code? [21:59:04] ya i see proocessoors failing too [21:59:30] although other charts seem ok [21:59:37] ottomata: db1107 went under network maintenance, I stopped only the mysql daemons [21:59:46] and the reenabled via puppet [22:00:21] it looks like some html is being returned by the code that is mysql code that is looking up the schema [22:00:44] no new code there. [22:01:30] I see the processors failing to validate some events, not failing completely no? [22:02:08] i saw one fail, maybe it was just a fluke [22:04:37] first occurrence for m4-consumer was [22:04:38] Mar 19 14:03:05 eventlog1002 eventlogging-consumer@mysql-m4-master-00[28281]: 2019-03-19 14:03:05,094 [28281] (MainThread) Log [WARNING] Exception caught Traceback (most recent call last): [22:04:47] of the weird html thing [22:09:43] i can reproduce, and i maybe can see a way to fix [22:09:49] but i do not know why it would change all of the sudden [22:09:54] OH [22:10:09] maybe my recent change to the el extension API is causing this [22:10:18] and it didn't happen til now because the process hadn't restearted [22:10:20] and schemas were cahced? [22:10:40] the url it is using toi get schema is returning html instead of json... [22:10:59] i didn't change that..... tho [22:11:00] uhhh [22:11:06] ok i can fix in el real quick I think. [22:11:14] dunno cause bu i think i know fix [22:11:46] ottomata: the thing that doesn't add up to me is that the timing for my stop daemons was Mar 19 16:27:44 [22:12:02] but the issue seems happened at around 14 UTC (the drop + start of the weird logs) [22:12:40] ok, confirmed my fix works, making code chan ge and eploying [22:12:45] oh hm dunno [22:12:54] maybe somehting cause it to restart before that? [22:12:55] not sure [22:13:19] and this is only mysql right? [22:13:41] if this doesn't wokr here [22:13:48] i'd expect it not to work for anything [22:13:57] elukey: eventlogging-consumer@mysql-m4-master-00 [22:13:59] oops [22:14:01] https://gerrit.wikimedia.org/r/#/c/eventlogging/+/497648/ [22:14:26] i've applied a hotfix on el1002 [22:14:31] waiting for jenkins then will merge/deploy [22:14:40] (and i have to run very soon for a pottery class! aieee!) [22:14:44] i will bring my compy= with me [22:15:34] seems working from the logs! [22:15:39] super weird [22:15:43] maybe they changed something? [22:15:51] om the API I mean [22:17:07] anyway, looks everything is under control [22:17:08] :) [22:17:11] ping me if needed! [22:17:14] * elukey afk! [22:19:06] i mean i did recently...but maybe the format response stuff was changed? i dunno if that is in EL extension or not [22:19:14] thanks!~ [22:19:18] ok too impatient, merging