[01:20:57] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Oct-Dec 2019): Explore importing geoeditors_daily data (agrehgated edits per namespace per country per wiki) into druid - https://phabricator.wikimedia.org/T234281 (10Nuria) [01:28:36] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Oct-Dec 2019): Explore importing geoeditors_daily data (aggregated edits per namespace per country per wiki) into druid - https://phabricator.wikimedia.org/T234281 (10Nuria) [05:56:38] o/ [06:19:19] 10Analytics, 10User-Elukey: Port IRCRecentChanges to Kafka - https://phabricator.wikimedia.org/T232483 (10elukey) Yesterday I had a chat with @faidon about this project and this is what I gathered: * we currently run a patched ircd daemon on kraz (`role::mw_rc_irc` in site.pp) that serves `irc.wikimedia.org`... [06:19:57] 10Analytics, 10User-Elukey: Architecture of recent changes on top of kafka. Produce Design Document. - https://phabricator.wikimedia.org/T234234 (10elukey) [06:20:39] 10Analytics, 10User-Elukey: Port IRCRecentChanges to Kafka - https://phabricator.wikimedia.org/T232483 (10elukey) 05Open→03Stalled As first step, we (Analytics) are going to create a quick design doc / one-pager about how the architecture should look like in T234234 [06:20:47] 10Analytics, 10Code-Stewardship-Reviews, 10Operations, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10elukey) [07:19:53] 10Analytics, 10Analytics-Kanban: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts - https://phabricator.wikimedia.org/T234229 (10elukey) [07:21:04] 10Analytics, 10Analytics-Kanban: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts - https://phabricator.wikimedia.org/T234229 (10elukey) @MoritzMuehlenhoff @ArielGlenn Thoughts? [07:21:19] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts - https://phabricator.wikimedia.org/T234229 (10elukey) [07:23:12] 10Analytics, 10Fundraising-Backlog, 10Operations, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10MoritzMuehlenhoff) 05Resolved→03Open There's two issues with the patch merged for Erin Yener: (1) I... [07:51:55] Hi team [07:55:05] o/ [07:58:38] elukey: I'm being kicked off chan almost every day recently - any idea? [08:01:20] joal: nope, what error do you get in your client? [08:01:44] No error - Ijust get kicked off chan and sent to wikimedia-overflow [08:09:26] strange [08:09:29] what client do you use [08:09:30] ? [08:09:36] irssi [08:09:50] same thing for me, but I am not kicked out [08:10:01] mwarf :() [08:10:19] maybe there is some setting that could prevent this? I have set up mine a ton of time ago and I don't recall if I had to do anything [08:10:28] but I can share the config if you want [08:11:06] elukey: when I disconnect, do you see anything strange on the chan, like split or something>? [08:15:26] joal: I don't have all the events logged, only when people join, but I can add all of them so we can monitor [08:15:42] no bother elukey - I'll look at logs [08:15:58] no bother at all :) [08:17:27] Nothing logged :( [08:19:02] I can see you loggin out and back in elukey [08:19:33] joal: yep just restarted irssi, I had some pending cleanups to do, also removed the ignores [08:19:48] next time that happens I'll tell you what I see [08:20:08] Thanks [08:21:41] elukey - Do you trhink me asking ottomata to setup alluxio for test with presto is a good idea? [08:32:03] joal: I have no idea what alluxio is but +1 [08:32:14] :D [08:32:50] ah nice https://www.alluxio.io/ [08:33:54] no idea how it would need to be deployed, is it a daemon? [08:34:59] elukey: here is my thought process: one of the main advantage of big-data computation is data-locallity optimization. Presto as it is setup looses this advantage (not part of HDFS) - IMO alluxio could be an interesting helper :) [08:35:14] elukey: I thik it's a system with master/wokers [08:35:24] elukey: like another layer on top of hdfs [08:36:20] ah so it would need to be deployed on hadoop [08:36:21] right? [08:36:26] (I mean on hadoop workers) [08:37:11] elukey: I dont think so - deployed on machines where the cache happens (presto in our case), and configured to cache data flowing in from HDFS [08:37:57] ah ok, clearer now [08:38:15] I don't see why not! Maybe let's wait to see the first results form andrew [08:38:19] elukey: My idea is that caching data at the presto-machine level could help [08:38:27] make sense yes [08:38:46] does presto need zookeeper? [08:39:00] IIRC it does [08:39:37] that brings me to the next question! [08:39:56] an-conf100[1-3] are ready, they have been serving the hadoop test cluster for the past days [08:39:57] Moar questionz, moar answerz (maybe) [08:40:15] there are some caveats though [08:40:24] the most important one is that they run Java 11 [08:40:35] I tried java 8, but it doesn't work with the zookeeper version on buster [08:40:51] but, no issue so far from the hdfs zkfc logs etc.. [08:41:18] (like serialization troubles, etc..) [08:41:34] so I'd be inclined to move the hadoop prod cluster to the new zookeeper nodes [08:41:41] and presto/alluxio/etc.. [08:42:29] works for me elukey - Getting out of the prod-zookeeper is something we've been waiting for IIRC [08:44:03] joal: the main "issue" is that we'll have to stop hdfs/yarn briefly on the master nodes [08:44:17] otherwise we might end up in a split-brain scenario [08:44:51] so something like - 1) cluster drain 2) hdfs safe mode 3) shutdown of the master daemons 4) merge/run-puppet with the new config [08:44:55] yes - well, there are changes that needs the machine to stop :) [08:45:41] when would it be ok for you? [08:45:46] (to schedule the change) [08:46:41] whenever elukey - As usual I'll only be here watching you ;) [08:47:00] ack, will prep the etherpad and possibly schedule it for tomorrow [08:47:20] elukey: arf - tomorrow kids (sorry should have mentioned) [08:47:29] thursday? [08:48:16] yessir [08:50:28] ah yes right Thu is fine! [08:55:11] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Oct-Dec 2019): Develop a tool or integrate feature in existing one to visualize WMCS edits data - https://phabricator.wikimedia.org/T226663 (10JAllemandou) Hi @srishakatux - I'm sorry for not answering sooner, I've been sick the whole... [09:04:41] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts - https://phabricator.wikimedia.org/T234229 (10ArielGlenn) Adding @Bstorm because the labstore servers are WMCS boxes. How long does data take to show up on the lab... [09:08:51] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing the dump hosts - https://phabricator.wikimedia.org/T234229 (10elukey) The main problem is that a ton of data (like the recent mediawiki history dumps) need to go through the fuse h... [09:41:47] (03PS3) 10Joal: Add network-origin to the geoeditors-daily table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538613 (https://phabricator.wikimedia.org/T233504) [09:58:15] 10Quarry, 10User-revi, 10cloud-services-team (Kanban): Quarry struggling on the queue (2019-10-01) - https://phabricator.wikimedia.org/T234310 (10revi) [09:58:58] elukey: found that today - https://blog.ippon.fr/content/images/2018/03/6ced44f31377835938ccb66275194e2b3ccea500967210c35ca8fb2343cbaf8d.jpg [10:01:19] hahaahhaahahahh [10:01:37] made my day [10:04:37] kerberos replication between eqiad/codfw works [10:04:40] what a nightmare [10:04:50] you man should be named Hercules [10:06:36] <# [10:06:38] <3 [10:07:41] it is frustrating since the error messages are so brief and not explanatory [10:08:00] need to write some docs now [10:08:07] otherwise I'll forget the magic sequence [10:13:16] all kerberos error messages are cryptic, it's one of the biggest problems krb5 has [10:29:41] * elukey lunch! [11:34:18] I don't know if the mysql issue is related, but I'm experiencing problems on stat1007 - I get an HDFS command error ar regular interval (more or less) [11:56:37] 10Quarry, 10User-revi, 10cloud-services-team (Kanban): Quarry struggling on the queue (2019-10-01) - https://phabricator.wikimedia.org/T234310 (10zhuyifei1999) 05Open→03Resolved a:03zhuyifei1999 I ssh-ed in to both workers, no weird behavior. No logs either. It's as if the celery workers never received... [12:07:44] joal: I think it is related, the rsync is causing 1) a ton of resources consumed by the fuse mount 2) a ton of sockets opened [12:07:59] mwarf [12:08:36] elukey: indeed error is of type networking (Namenode for analytics-hadoop remains unresolved for ID an-master1001-eqiad-wmnet) [12:12:35] 10Analytics, 10Fundraising-Backlog, 10Operations, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10EYener) Thank you! I also have access to Turnilo. I have two follow-up questions: 1. Can @jkumalah and... [12:30:43] 10Analytics, 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic, and 2 others: No access to mysql from stat1007 - https://phabricator.wikimedia.org/T234160 (10elukey) We are almost sure that this issue is due to network/system overloading due to a big rsync that is currently running... [12:47:59] 10Analytics, 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic, and 2 others: No access to mysql from stat1007 - https://phabricator.wikimedia.org/T234160 (10GoranSMilovanovic) @elukey Thank you. I was able to collect the data needed for T234036 from `stat1004`, so I will close thi... [12:48:08] 10Analytics, 10Analytics-Kanban, 10WMDE-Analytics-Engineering, 10User-GoranSMilovanovic, and 2 others: No access to mysql from stat1007 - https://phabricator.wikimedia.org/T234160 (10GoranSMilovanovic) 05Open→03Resolved [12:53:13] (03CR) 10Joal: "This patch needs to be rebased on top of its updated parent." (036 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) (owner: 10Fdans) [13:44:38] joal o/ yt? [13:44:49] hey ottomata - Here I am yes :) [13:45:09] how may I help? [13:45:53] i want to backfill rev score [13:46:01] ok [13:46:07] so i'm trying to construct a query to transform the array fields into our new map fields [13:46:33] i've moved the old table and partitions to otto.mediawiki_revision_score_1 [13:47:04] am reading stuff, trying to figure out how...but thought you might have some tips :) [13:47:39] ottomata: hm - I've not done that I think [13:48:20] ottomata: I'd probably use RDDs :) [13:49:08] oh that would work...right... [13:49:08] hmmm [13:50:10] ottomata: I have a the problem you encouterred with camus as well [13:50:58] ottomata: a QUESTION sorry :) [13:51:44] joal: ya? [13:52:13] ottomata: Couold have been an issue with consumer-group wrong value? [13:53:10] joal: i don't think so...camus doesn't use consumer groups :) [13:53:19] the offsets are in hdfs [13:53:40] and the offset history files existed and were being read by camus [13:53:53] and i'm pretty sure data was coming back from kafka at the requested offset [13:56:26] ottomata: I was after trying to understand the empty iterator thing - Ah right - no data stored in kafka-internals in camus... [13:56:31] ok [13:57:05] ottomata: if ou'd have a minute I could use a quick chat on spark memory - I'm facing wonders [13:57:30] ottomata: o/ [13:57:42] does presto use zookeeper by any chance? Didn't see any setting today [13:57:57] but I was wondering that.. if so, it would be great to make it use the an-conf100X hosts [13:58:14] I was chatting today with Joseph about migrating Hadoop Prod to the new zk nodes on thursday [14:01:10] oh ottomata - Another topic: alluxio for presto ! [14:02:30] there are days like that - I possibly could call them the otto-days :) [14:05:22] I am writing https://wikitech.wikimedia.org/wiki/User:Elukey/Analytics/Kerberos to gather all the info about the kerberos cluster [14:05:30] still wip but should contain all [14:22:29] joal: sorrry for some reason my irc client isn't pinging me... [14:22:37] alluxio ...? [14:22:38] looking [14:24:25] joal not sure I understand? [14:24:29] elukey: nice! [14:24:52] ottomata: currently presto needs to be transfered all the data everytime it is queried [14:25:00] 10Analytics, 10Analytics-Kanban, 10Cloud-Services, 10Developer-Advocacy (Oct-Dec 2019): Develop a tool or integrate feature in existing one to visualize WMCS edits data - https://phabricator.wikimedia.org/T226663 (10Milimetric) @srishakatux / @bd808: what would you like to see in your dashboard? Off the t... [14:25:35] ottomata: what would the schema registry UI do? [14:25:48] ottomata: With alluxio, if some data is queried over and over, it'll be cached locally on the presto nodes instead of being hdfs-copied to them at every query [14:27:34] oh [14:29:15] 10Analytics: Import siteinfo dumps onto HDFS - https://phabricator.wikimedia.org/T234333 (10JAllemandou) [14:29:36] (03PS1) 10Joal: Add site-info dump type to importer [analytics/refinery] - 10https://gerrit.wikimedia.org/r/540124 (https://phabricator.wikimedia.org/T234333) [14:30:03] ottomata: would you have a minute to batcave on spark memory? [14:30:34] milimetric: i was mostly searching for stuff to satisfy use cases like " [14:30:34] As an analyst/product manager I want to able to search through existing schemas to find which data is being collected and how the data is defined in the event system. [14:30:34] " [14:30:36] joal sure! [14:30:59] actually joal gimme 4 mins [14:31:03] np [14:31:38] ottomata: I thought Apache Atlas was giving us that [14:32:05] joal, ottomata - this is a draft of the procedure to swap the zk nodes https://etherpad.wikimedia.org/p/analytics-zk-migration [14:32:13] let me know your thoughts when you have a minute [14:32:23] should be simple enough, but better to triple check [14:33:36] elukey: only dowside I can think of is the loss of the historical jobs for the time - but seems small [14:33:57] milimetric: i dunno maybe...? [14:34:05] yes that is the the downside, unless we dump the status of zookeeper etc.. [14:34:10] schema / config / governmance is a long way off and not prioritized i think [14:34:12] anyway [14:34:12] too much work [14:34:19] i was just browsing around and wanted to note that thing [14:34:36] the thing is great, we should deploy it and get over everyone's fear of .NET and be happy [14:34:58] MS basically just threw $$ at developer sad, it's great [14:35:20] (03CR) 10Milimetric: Transition data rows to using time ranges instead of timestamps (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/531148 (https://phabricator.wikimedia.org/T230514) (owner: 10Fdans) [14:35:34] (03CR) 10Milimetric: "check experimental" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/531148 (https://phabricator.wikimedia.org/T230514) (owner: 10Fdans) [14:36:00] (03CR) 10Milimetric: "heh, maybe it doesn't work like that anymore? :)" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/531148 (https://phabricator.wikimedia.org/T230514) (owner: 10Fdans) [15:04:02] \o/ phab inbox 0 [15:04:16] heya elukey yt? [15:05:03] i think we accidentally skipped adding an-worker1088 to net_topology! :) [15:05:04] https://gerrit.wikimedia.org/r/c/operations/puppet/+/474904/3/hieradata/common.yaml [15:05:14] unless we did it on purpose [15:05:15] do you know? [15:06:00] ottomata: nono surely not on purpose [15:06:07] ok going to add it [15:06:15] won't bother restarting but just in case [15:06:18] but is it in the default rack/whatever? [15:06:20] just so we have it there nexts time [15:06:21] yes [15:06:31] that is bad, because we have an alarm [15:06:39] so it doesn't work [15:06:43] we do? [15:06:47] can you wait a sec before fixing it? [15:06:49] ya [15:06:53] i'll just make the patch [15:08:53] ah snap! [15:08:54] Rack: /eqiad/default/rack [15:08:54] 10.64.36.100:50010 (an-worker1088.eqiad.wmnet) [15:09:02] we have /usr/local/lib/nagios/plugins/check_hdfs_topology [15:09:14] but it does sudo -u hdfs hdfs dfsadmin -printTopology | grep -q 'Rack: default' [15:09:32] this is why it doesn't work [15:09:50] we could simply grep for default? [15:10:03] grep -iq 'default' [15:10:37] or maybe something more elaborate like 'Rack.*default.*' [15:11:17] sudo -u hdfs hdfs dfsadmin -printTopology | egrep 'Rack:.*default.*' [15:12:16] sending a code patch [15:14:26] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/540150/ [15:19:43] we can deploy your patch when we switch the zk clusters? [15:25:29] a! [15:25:57] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Make the Kerberos infrastructure production ready - https://phabricator.wikimedia.org/T226089 (10elukey) Summary of progresses: * set up krb1001 in eqiad and krb2001 in codfw * set up a basic replication between 1001 and 1002 via kprop/kpropd * documented ever... [15:44:09] elukey: we can merge my patch anytime [15:44:21] and wait for whatever next nm/rm restarts we do [15:44:31] ....and nodemanagers? [15:44:41] i think maybe just the nm and rm uses net topology? [15:45:49] +1 [15:58:41] (03PS1) 10Mforns: Migrate reports from MySQL EventLogging to Hive [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/540159 (https://phabricator.wikimedia.org/T223414) [16:14:07] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move the Analytics Refinery to Python 3 - https://phabricator.wikimedia.org/T204735 (10elukey) Another weird use case happened: after running `apt-get autoremove` on an-worker1080, python2 dependencies got cleaned up but hen druid_loader.py caused some fai... [16:19:17] (03PS2) 10Mforns: Migrate reports from MySQL EventLogging to Hive [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/540159 (https://phabricator.wikimedia.org/T223414) [16:30:33] (03CR) 10Milimetric: [C: 03+1] "I just did a sanity check on the queries, they look fine to me. I wouldn't worry too much since you can just rerun in case of a problem. " [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/540159 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [16:43:02] a-team: Asheville, NC too! That's a great option [16:43:06] best ice cream I've ever had [16:51:40] * elukey off! [16:51:41] o/ [17:06:30] 10Analytics, 10MinervaNeue, 10Readers-Web-Backlog: MinervaClientError sends malformed events - https://phabricator.wikimedia.org/T234344 (10Jdlrobson) [18:57:52] joal: don't suppose you are around? [19:10:27] 10Analytics, 10Fundraising-Backlog, 10Operations, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10Nuria) Is @herron going to correct the patch per @MoritzMuehlenhoff guidelines? @EYener: You should ha... [19:50:25] hey ottomata - passing by now - how may I help? [19:54:53] 10Analytics, 10Fundraising-Backlog, 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) >>! In T233636#5536946, @MoritzMuehlenhoff wrote: > There's two issues wi... [19:56:57] hm - gone again - we'll see tomoroow [20:09:49] oh hey [20:09:50] sorry! [20:09:53] joal ok tomorrow is fine [20:10:04] no worriees, i'm deep in spark land [20:10:09] maybe getting somewhere...not sure [20:16:39] hey a-team: is there a process around de-whitelisting a schema? the Growth team is looking to stop storing Help Panel data. Didn't see this mentioned on wikitech, but maybe I was looking in the wrong place? [20:26:41] Nettrom: i don't think anyone has ever asked to do that! :) [20:27:04] I think the process would be the same as whitelisting tho, but with less review required? [20:27:25] ottomata: I was suspecting that our request would be somewhat rare, yes ;) [20:27:26] although, i don' tthink anything will autmoatically delete the old data [20:27:33] it'll just start sanitzingthe new stuff [20:27:38] we'd have to manually sanitize the old stuff...i think [20:29:08] what I'm thinking the process might look like is that: 1) I create a phab task; 2) submit a patch to remove the schema from the whitelist; 3) back up relevant data we need to keep (part of a different experiment); 4) after the patch is deployed, ask someone to delete the table [20:29:13] would that sound about right? [20:29:58] sounds right to me! [20:31:54] cool! I'll make sure a phab task is opened once we're ready to start moving [20:31:58] and thanks! [20:47:22] 10Analytics, 10Fundraising-Backlog, 10Fundraising Sprint Sysadmin Kane, 10Fundraising Sprint T 2019: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Ejegg) [22:00:14] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Scoring-platform-team, 10Patch-For-Review: Change event.mediawiki_revision_score schema to use map types - https://phabricator.wikimedia.org/T225211 (10Ottomata) The Hive table is looking good and new data is coming in. I moved the old data away and created... [22:03:55] PROBLEM - HDFS Namenode RPC 8020 call queue length on an-master1001 is CRITICAL: 599 ge 20 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [22:06:51] RECOVERY - HDFS Namenode RPC 8020 call queue length on an-master1001 is OK: (C)20 ge (W)10 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [22:17:51] PROBLEM - HDFS Namenode RPC 8020 call queue length on an-master1001 is CRITICAL: 570 ge 20 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [22:33:14] (03CR) 10Nuria: [C: 03+1] "Looks good, if we have tested teh job +2" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538613 (https://phabricator.wikimedia.org/T233504) (owner: 10Joal) [22:43:45] RECOVERY - HDFS Namenode RPC 8020 call queue length on an-master1001 is OK: (C)20 ge (W)10 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [23:04:47] PROBLEM - HDFS Namenode RPC 8020 call queue length on an-master1001 is CRITICAL: 787 ge 20 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [23:09:37] RECOVERY - HDFS Namenode RPC 8020 call queue length on an-master1001 is OK: (C)20 ge (W)10 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [23:18:18] PROBLEM - HDFS Namenode RPC 8020 call queue length on an-master1001 is CRITICAL: 610 ge 20 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [23:23:36] RECOVERY - HDFS Namenode RPC 8020 call queue length on an-master1001 is OK: (C)20 ge (W)10 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [23:33:38] PROBLEM - HDFS Namenode RPC 8020 call queue length on an-master1001 is CRITICAL: 620 ge 20 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [23:37:25] (03CR) 10Nuria: Migrate reports from MySQL EventLogging to Hive (032 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/540159 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [23:38:00] RECOVERY - HDFS Namenode RPC 8020 call queue length on an-master1001 is OK: (C)20 ge (W)10 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration https://grafana.wikimedia.org/dashboard/db/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=54&fullscreen [23:39:36] (03CR) 10Nuria: Add oozie job to load top mediarequests data (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) (owner: 10Fdans) [23:40:40] (03CR) 10Nuria: [C: 03+1] "Change looks fine" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/540124 (https://phabricator.wikimedia.org/T234333) (owner: 10Joal)