[01:21:07] 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4159961 (10Milimetric) [01:21:57] 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4102098 (10Milimetric) Ok, @Nuria: I've incorporated some hard-fought analysis about the percent differences here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Geowiki#Percent_Difference [01:22:48] note to self: jump out of a second story window next time someone asks me to do any kind of statistics. The broken bones are sure to be more pleasant [01:23:23] note to joal: I admire and envy your pretty graphs and what must be amazing spreadsheet skills. [10:54:02] 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (watching): Enable snappy compression for eventbus Kafka producer - https://phabricator.wikimedia.org/T193080#4160779 (10mobrovac) [11:19:01] 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4160899 (10fdans) OK, so to determine the periodicity of the cron job, I ran a city query over ~17,000 IP addresses with: - The most current GeoIP d... [11:20:51] hellloooo milimetric: can you take a look at this? ^ [11:22:04] 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4160932 (10faidon) As far as periodicity goes, note that MaxMind [[ https://support.maxmind.com/geoip-faq/geoip2-and-geoip-legacy-databases/how-often-a... [12:42:47] 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4161078 (10Milimetric) thanks @faidon, we were just seeing if maybe the accuracy of the old databases is really high, we can schedule the jobs less oft... [12:42:51] fdans: keeping it weekly is the only choice there, commented [12:43:22] fdans: good thing you ran that! It's so bad after even a month!! [13:01:06] 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#2345955 (10JAllemandou) +1 for weekly on wednesday. Thanks @fdans and @faidon :) [13:09:04] 10Analytics, 10Discovery-Analysis, 10Product-Analytics: Get 'sparklyr' working on stats1005 - https://phabricator.wikimedia.org/T139487#4161142 (10Ottomata) Anyone can kill YARN jobs that they own: https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application `yarn kill -... [13:40:44] 10Analytics, 10Discovery-Analysis, 10Product-Analytics: Get 'sparklyr' working on stats1005 - https://phabricator.wikimedia.org/T139487#4161203 (10GoranSMilovanovic) @Ottomata Thank you - I didn't know about this one. I will need to check further why this happens with {sparklyr}. [13:45:12] hmmm... I accidentally pushed directly to refinery-source master branch [13:45:41] instead of HEAD:refs/for/master [13:45:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161212 (10Ottomata) [13:52:17] (03PS1) 10Mforns: Correct default EL whitelist path in ELSanitization.scala (+CR) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429194 [13:59:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161238 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1057.eqiad.wmnet'] ``` T... [13:59:05] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161239 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1056.eqiad.wmnet'] ``` T... [14:07:41] (03CR) 10Ottomata: [C: 031] Correct default EL whitelist path in ELSanitization.scala (+CR) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429194 (owner: 10Mforns) [14:16:08] ottomata: hm, I guess the script doesn't need to be a template anymore then :) [14:19:55] yup! [14:23:48] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161288 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1056.eqiad.wmnet'] ``` and were **ALL** successful. [14:25:07] 10Analytics-Kanban, 10Patch-For-Review: Add defaults section to WhitelistSanitization.scala - https://phabricator.wikimedia.org/T190202#4161297 (10Milimetric) p:05Triage>03Normal [14:25:23] 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Make mediawiki-history-reduced table permanent (snapshot partitioning) - https://phabricator.wikimedia.org/T192482#4161298 (10Milimetric) p:05Triage>03Normal [14:32:01] 10Analytics-Kanban, 10Patch-For-Review: Checklist for geowiki pipeline - https://phabricator.wikimedia.org/T190409#4161307 (10Milimetric) [14:35:42] 10Analytics-Kanban: Clean up wmf_raw.mediawiki_private_cu_changes and wmf.geowiki_daily - https://phabricator.wikimedia.org/T193165#4161310 (10Milimetric) p:05Triage>03Normal [14:36:22] fdans: wait what did we decide to do with wikistats deployment? [14:37:06] milimetric: keep doing the same thing but avoid the glitch [14:37:39] how do we avoid the glitch? [14:47:44] 10Analytics: Add templating support to reportupdater scripts - https://phabricator.wikimedia.org/T163252#3191334 (10mforns) This is already supported by RU, see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater#The_reports_section The template parameters are passed to the script in alphabetica... [14:50:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161360 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1057.eqiad.wmnet'] ``` and were **ALL** successful. [14:50:51] fdans, mforns : i have fixed the tests for ua parser but more changes are needed, will continue working on this tomorrow [14:51:07] nuria_, saw that in the email [14:51:21] k [14:51:45] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161365 (10mforns) [14:51:49] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161375 (10mforns) [14:52:19] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161365 (10mforns) [14:52:20] 10Analytics: Investigate adding user-friendly testing functionality to Reportupdater - https://phabricator.wikimedia.org/T156523#4161376 (10mforns) [14:52:41] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161365 (10mforns) [14:52:44] 10Analytics, 10Easy: Reportupdater: do not write execution control files in source directories - https://phabricator.wikimedia.org/T173604#4161378 (10mforns) [14:53:53] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161381 (10mforns) [14:53:55] 10Analytics: Reportupdater writes a README in the output folder - https://phabricator.wikimedia.org/T163134#4161380 (10mforns) [14:54:10] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161365 (10mforns) [14:54:12] 10Analytics: Add templating support to reportupdater scripts - https://phabricator.wikimedia.org/T163252#4161382 (10mforns) [14:54:58] 10Analytics: Add templating support to reportupdater scripts - https://phabricator.wikimedia.org/T163252#3191334 (10mforns) Added this task as a subtask of T193167 for the record, but I think this can be closed as resolved or invalid. [14:56:13] 10Analytics: [reportupdater] add hourly granularity - https://phabricator.wikimedia.org/T193168#4161389 (10mforns) [14:56:21] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161399 (10mforns) [14:56:24] 10Analytics: [reportupdater] add hourly granularity - https://phabricator.wikimedia.org/T193168#4161398 (10mforns) [15:02:42] 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4161424 (10mforns) [15:02:49] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161436 (10mforns) [15:02:51] 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4161435 (10mforns) [15:10:32] 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4161463 (10mforns) [15:10:53] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161475 (10mforns) [15:10:55] 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4161474 (10mforns) [15:13:18] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1005 is CRITICAL: Return code of 255 is out of bounds [15:13:52] hm [15:16:19] ah it works! [15:16:24] stat1005 died [15:16:30] :) [15:17:01] what? I can ssh into it [15:17:05] oh before [15:17:23] yeah, and /mnt/hdfs is not usable [15:17:42] 10Analytics: [reportupdater] consider not requiring date as a first colum of wuery/script results - https://phabricator.wikimedia.org/T193174#4161508 (10mforns) [15:17:47] (got the message and logged in to check :) [15:17:48] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161519 (10mforns) [15:17:50] 10Analytics: [reportupdater] consider not requiring date as a first colum of wuery/script results - https://phabricator.wikimedia.org/T193174#4161518 (10mforns) [15:18:31] 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161508 (10mforns) [15:19:03] i umounted and mounted it [15:19:06] (my internet keeps dying!) [15:19:06] super [15:19:14] the alert works though yay! [15:22:16] going afk again to finish the move [15:22:34] (ETOOMANYBOXESTODAY) [15:24:08] laterrrs [15:26:09] 10Analytics, 10Analytics-Kanban: Productionize EventLoggingSanitization - https://phabricator.wikimedia.org/T193176#4161574 (10mforns) [15:27:41] 10Analytics, 10Analytics-Kanban: Productionize EventLogging sanitization - https://phabricator.wikimedia.org/T193176#4161574 (10mforns) [15:43:18] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1005 is OK: OK [15:44:37] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161630 (10Ottomata) [15:50:36] ottomata, I'm not sure where the cron job for EventLoggingSanitization.scala needs to go. There's this refine.pp... the ELSanitization could be a refine job, but it needs to parse and validate the whitelist, so it has an extra previous step, that makes it not suited for refine.pp no? The other option I saw is data_drop.pp. What do you think? [15:55:33] hm [15:55:42] mforns: i'd be fine with either, maybe we can rename data_drop.pp [15:56:24] refinery::job::purge [15:56:24] ? [15:57:16] or mforns we could add an $extra_ops param to refine_job [15:57:17] ? [15:57:24] hm, naw it is a different main class [15:57:28] naw let's just make it a one of cron job [15:57:54] a-team i’ll be a couple min late to standup [15:59:17] ottomata, ok [16:02:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161741 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1054.eqiad.wmnet'] ``` T... [16:02:53] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161742 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1055.eqiad.wmnet'] ``` T... [16:04:43] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 2 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4161768 (10mobrovac) [16:27:13] 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959#4161862 (10fdans) p:05Triage>03Normal [16:27:20] 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959#4155423 (10fdans) p:05Normal>03Low [16:29:23] 10Analytics: Under construction page in wikistats to take site down - https://phabricator.wikimedia.org/T192847#4161880 (10fdans) p:05Triage>03High [16:30:11] 10Analytics, 10Analytics-Kanban: Productionize EventLogging sanitization - https://phabricator.wikimedia.org/T193176#4161884 (10fdans) p:05Triage>03High [16:31:47] 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161887 (10fdans) p:05Triage>03Unbreak! [16:31:55] 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161891 (10fdans) p:05Unbreak!>03Triage [16:32:29] 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (watching): Enable snappy compression for eventbus Kafka producer - https://phabricator.wikimedia.org/T193080#4161892 (10fdans) p:05Triage>03Normal [16:32:46] 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4161896 (10mforns) p:05Triage>03Unbreak! [16:32:48] 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161900 (10mforns) p:05Triage>03Unbreak! [16:32:59] 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4161902 (10mforns) p:05Triage>03Unbreak! [16:33:01] 10Analytics: [reportupdater] Add a configurable hive client - https://phabricator.wikimedia.org/T193169#4161906 (10mforns) p:05Triage>03Unbreak! [16:33:04] 10Analytics, 10Commons, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Make gwtoolsetUploadMediafileJob JSON-serializable - https://phabricator.wikimedia.org/T192946#4161910 (10fdans) p:05Normal>03Triage [16:33:09] 10Analytics: [reportupdater] add hourly granularity - https://phabricator.wikimedia.org/T193168#4161913 (10mforns) p:05Triage>03Unbreak! [16:33:22] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161919 (10mforns) p:05Triage>03Unbreak! [16:33:24] 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161926 (10mforns) p:05Unbreak!>03Triage [16:33:26] 10Analytics, 10Collaboration-Team-Triage, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Make EchoNotification job JSON-serialiable - https://phabricator.wikimedia.org/T192945#4161923 (10fdans) p:05Normal>03Triage [16:33:38] 10Analytics: [reportupdater] Add a configurable hive client - https://phabricator.wikimedia.org/T193169#4161927 (10mforns) p:05Unbreak!>03Triage [16:34:06] 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4161929 (10mforns) p:05Unbreak!>03Triage [16:34:15] 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4161930 (10mforns) p:05Unbreak!>03Triage [16:34:25] 10Analytics: [reportupdater] add hourly granularity - https://phabricator.wikimedia.org/T193168#4161931 (10mforns) p:05Unbreak!>03Triage [16:34:59] 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161933 (10mforns) p:05Unbreak!>03Triage [16:47:01] joal: https://github.com/wikimedia/analytics-refinery/blob/master/hive/webrequest/check_dataloss_false_positives.hql [16:47:06] joal: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql [17:32:13] 10Analytics, 10Cassandra, 10Maps-Sprint, 10Operations, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10Eevans) The restbase cluster has been upgraded package-wise, but a rolling restart still needs to be scheduled. [17:34:20] 10Analytics, 10Cassandra, 10Maps-Sprint, 10Operations, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10MoritzMuehlenhoff) >>! In T192948#4162125, @Eevans wrote: > The restbase cluster has been upgraded package-wise, but a rolling... [17:38:26] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4162150 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1055.eqiad.wmnet'] ``` and were **ALL** successful. [17:38:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4162151 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1054.eqiad.wmnet'] ``` and were **ALL** successful. [17:40:08] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [17:41:00] whaaaa [17:41:09] whaaa [17:41:10] looking [17:42:14] elukey: i see no reason why it shut down [17:42:18] just that it did? [17:42:40] 1002 has taken over as active [17:43:01] starting it back up.. [17:43:14] java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. [17:43:19] oh [17:43:20] zk? [17:43:23] journalnodes? [17:43:31] it might be either of the two [17:43:55] 2018-04-26 17:37:58,103 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.64.36.128:8485, 10.64.53.14 [17:43:59] :8485, 10.64.5.15:8485], stream=QuorumOutputStream starting at txid 1886903961)) [17:45:52] hm [17:46:29] i didn't reboot any journanodes [17:47:32] so in theory when two out of three jn are not available the namenode shutsdown [17:47:46] ? [17:47:49] oh two out of three [17:47:49] ya [17:48:00] they seem fine though [17:49:17] yeah I don't find anything [17:50:05] ottomata: as follow up, should we change the pager to analytics only? [17:54:24] I can see a ton of INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_1287766178_214099831 to [17:54:27] right before the issue [17:57:06] it is weird, they started at 11 UTC [17:57:19] wow elukey my computer just hard crashed [17:57:19] sorry [17:57:32] no problem :) [17:57:45] elukey, ottomata: spike in processes and load on an1028, an1035 and an1052 at 17:37 UTC - Could be related? [17:58:02] sounds related [17:58:03] all at once [17:58:04] ? [17:58:06] yup [17:58:10] on the 3 nodes [17:58:16] were there like a huge number of edits to make or something? [17:58:23] lots of files being created/deleteed/moved? [17:58:47] ottomata: doesn't seem related to IOs [17:59:29] 10Analytics, 10Beta-Cluster-Infrastructure, 10ChangeProp, 10EventBus, and 3 others: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4162216 (10mobrovac) [18:00:10] I should have added the RPC metrics for journalnodes [18:00:21] they are not in the dashboard yet, lemme add them so we can see [18:00:51] joal: very interesting correlation with your observation [18:00:52] https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=13&fullscreen&var-server=analytics1035&var-datasource=eqiad%20prometheus%2Fops [18:01:18] Nice [18:01:47] From hadoop, only a single job finished around that time - I had nothing special I think [18:03:44] weird [18:04:07] elukey: shall I just restart 1001 and transition it to active? [18:04:14] and see if it happens again? [18:04:23] tg for standby nn and auto failover! :) [18:05:35] I can access UI, elukey must have done it already [18:05:49] ottomata: --^ [18:06:05] nope I haven't touched anyhing [18:06:17] joal: not yarn [18:06:18] namenode [18:06:21] ottomata: shall we set the page to analytics only? [18:06:25] yes [18:06:32] Ah ! [18:07:31] elukey: i'm starting up the name node [18:07:43] ok [18:09:19] RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [18:10:54] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162236 (10Ottomata) [18:12:23] ok elukey its started and safemode has been turned off [18:12:28] i'm going to promote it back to active [18:13:02] oh, we have to just boucne nn on 1002 right? [18:13:23] oh failover riiight [18:14:26] (done) [18:14:34] 2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 20579 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. [18:14:38] 2018-04-26 17:37:58,103 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20557ms [18:14:41] 2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.36.128:8485 [18:14:45] 2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.53.14:8485 [18:14:49] 2018-04-26 17:37:58,103 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.64.36.128:8485, 10.64.53.14:8485, 10.64.5.15:8485], stream=QuorumOutputStream starting at txid 1886903961)) [18:14:53] so these are the only things on an1001 that are relevant [18:14:55] the GC pause is horrible [18:14:56] 20s [18:15:04] :( [18:16:09] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162250 (10Ottomata) [18:16:50] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4151019 (10Ottomata) [18:17:26] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4151019 (10Ottomata) [18:18:06] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162258 (10Ottomata) [18:18:25] also another interesting thing [18:18:26] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=41&fullscreen&from=now-3h&to=now [18:18:44] from ~16:10 onwards there were a ton of under replicated blocks [18:18:48] hm [18:20:17] but I don't see anything weird on the journal node metrics [18:21:22] but we know for sure that there was a spike in load and time-waits on the journal nodes [18:23:04] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4151019 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['ka... [18:32:57] so from the journalnode metrics I don't see anything that clearly points to them for this [19:04:09] ok everything seems stable, I'll review metrics tomorrow morning with joal :) [19:04:32] I hope it is not related to the prometheus agent, it is the only thing that we changed recently [19:04:57] but no signs that this is the case [19:05:51] joal: last one https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&from=now-3h&to=now-1m&var-server=analytics1002&var-datasource=eqiad%20prometheus%2Fops [19:06:02] so analytics1002 was showing some signs of overload too [19:06:10] elukey: qq, you are moving prom jmx exproter yaml files into /etc/promethteus? [19:06:12] correct? [19:06:22] just had the race condition for /etc/kafka when reinstalling [19:06:28] i could enforce kafka package first [19:06:30] but not sure if that is right hting [19:07:06] ottomata: I already did it for all the hadoop workers, and I am planning to do the same with druid.. it makes live easier [19:07:43] ok, just /etc/prometheus, right? [19:08:57] yep [19:10:05] all right, going offline again, talk with you guys (hopefully) tomorrow :) [19:10:08] byyeee [19:10:08] o/ [19:15:30] Thanks a lot Luca :) [19:15:50] milimetric: if you're anywhere nearby I think I have the explanation for the webrequests stats [19:22:09] milimetric: I'm gonna stop working today, lete's discuss tomorrow :) [19:22:44] Oh nice joal, I’m at the doctor still, but if you can tell me over irc I’m all ears [19:23:10] oh, sorry, nite! See you next week, off tomorrow [19:23:16] milimetric: problem wass with '-' dt values [19:23:47] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162467 (10Ottomata) [19:25:06] In WSS, we remove them from our counts using sequences - therefore the min/max diff [19:26:24] Solution is to remove them from the current_hour part of the FPC query, for those sequence-numbers not to mess up the counts [19:26:29] I'll send a patch tomorrow :) [19:26:33] milimetric: --^ [19:26:44] I'm gone after that, more details next week :) [19:30:27] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162470 (10Ottomata) [19:31:26] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka2001.codfw.wmnet'] ``` and were **ALL** succ... [19:39:34] 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162496 (10Ottomata) Alright, kafka2001 is now Stretch. Waiting until Monday to proceed with more. [19:50:44] 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Enable snappy compression for eventbus Kafka producer - https://phabricator.wikimedia.org/T193080#4162511 (10Ottomata)