[01:21:07] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4159961 (10Milimetric)
[01:21:57] <wikibugs_>	 10Analytics-Kanban: Vet new geo wiki data - https://phabricator.wikimedia.org/T191343#4102098 (10Milimetric) Ok, @Nuria: I've incorporated some hard-fought analysis about the percent differences here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Geowiki#Percent_Difference
[01:22:48] <milimetric>	 note to self: jump out of a second story window next time someone asks me to do any kind of statistics.  The broken bones are sure to be more pleasant
[01:23:23] <milimetric>	 note to joal: I admire and envy your pretty graphs and what must be amazing spreadsheet skills.
[10:54:02] <wikibugs_>	 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (watching): Enable snappy compression for eventbus Kafka producer - https://phabricator.wikimedia.org/T193080#4160779 (10mobrovac)
[11:19:01] <wikibugs_>	 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4160899 (10fdans) OK, so to determine the periodicity of the cron job, I ran a city query over ~17,000 IP addresses with:   -  The most current GeoIP d...
[11:20:51] <fdans>	 hellloooo milimetric: can you take a look at this? ^
[11:22:04] <wikibugs_>	 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4160932 (10faidon) As far as periodicity goes, note that MaxMind [[ https://support.maxmind.com/geoip-faq/geoip2-and-geoip-legacy-databases/how-often-a...
[12:42:47] <wikibugs_>	 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#4161078 (10Milimetric) thanks @faidon, we were just seeing if maybe the accuracy of the old databases is really high, we can schedule the jobs less oft...
[12:42:51] <milimetric>	 fdans: keeping it weekly is the only choice there, commented
[12:43:22] <milimetric>	 fdans: good thing you ran that!  It's so bad after even a month!!
[13:01:06] <wikibugs_>	 10Analytics-Kanban, 10Patch-For-Review, 10Puppet: Puppetize job that saves old versions of Maxmind geoIP database - https://phabricator.wikimedia.org/T136732#2345955 (10JAllemandou) +1 for weekly on wednesday. Thanks @fdans  and @faidon  :)
[13:09:04] <wikibugs_>	 10Analytics, 10Discovery-Analysis, 10Product-Analytics: Get 'sparklyr' working on stats1005 - https://phabricator.wikimedia.org/T139487#4161142 (10Ottomata) Anyone can kill YARN jobs that they own:  https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#application  `yarn kill -...
[13:40:44] <wikibugs_>	 10Analytics, 10Discovery-Analysis, 10Product-Analytics: Get 'sparklyr' working on stats1005 - https://phabricator.wikimedia.org/T139487#4161203 (10GoranSMilovanovic) @Ottomata Thank you - I didn't know about this one. I will need to check further why this happens with {sparklyr}.
[13:45:12] <mforns>	 hmmm... I accidentally pushed directly to refinery-source master branch
[13:45:41] <mforns>	 instead of HEAD:refs/for/master
[13:45:59] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161212 (10Ottomata)
[13:52:17] <wikibugs_>	 (03PS1) 10Mforns: Correct default EL whitelist path in ELSanitization.scala (+CR) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429194
[13:59:02] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161238 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1057.eqiad.wmnet'] ``` T...
[13:59:05] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161239 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1056.eqiad.wmnet'] ``` T...
[14:07:41] <wikibugs_>	 (03CR) 10Ottomata: [C: 031] Correct default EL whitelist path in ELSanitization.scala (+CR) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/429194 (owner: 10Mforns)
[14:16:08] <fdans>	 ottomata: hm, I guess the script doesn't need to be a template anymore then :)
[14:19:55] <ottomata>	 yup!
[14:23:48] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161288 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1056.eqiad.wmnet'] ```  and were **ALL** successful.
[14:25:07] <wikibugs_>	 10Analytics-Kanban, 10Patch-For-Review: Add defaults section to WhitelistSanitization.scala - https://phabricator.wikimedia.org/T190202#4161297 (10Milimetric) p:05Triage>03Normal
[14:25:23] <wikibugs_>	 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Make mediawiki-history-reduced table permanent (snapshot partitioning) - https://phabricator.wikimedia.org/T192482#4161298 (10Milimetric) p:05Triage>03Normal
[14:32:01] <wikibugs_>	 10Analytics-Kanban, 10Patch-For-Review: Checklist for geowiki pipeline - https://phabricator.wikimedia.org/T190409#4161307 (10Milimetric)
[14:35:42] <wikibugs_>	 10Analytics-Kanban: Clean up wmf_raw.mediawiki_private_cu_changes and wmf.geowiki_daily - https://phabricator.wikimedia.org/T193165#4161310 (10Milimetric) p:05Triage>03Normal
[14:36:22] <milimetric>	 fdans: wait what did we decide to do with wikistats deployment?
[14:37:06] <fdans>	  milimetric: keep doing the same thing but avoid the glitch
[14:37:39] <milimetric>	 how do we avoid the glitch?
[14:47:44] <wikibugs_>	 10Analytics: Add templating support to reportupdater scripts - https://phabricator.wikimedia.org/T163252#3191334 (10mforns) This is already supported by RU, see: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Reportupdater#The_reports_section The template parameters are passed to the script in alphabetica...
[14:50:26] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161360 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1057.eqiad.wmnet'] ```  and were **ALL** successful.
[14:50:51] <nuria_>	 fdans, mforns : i have fixed the tests for ua parser but more changes are needed, will continue working on this tomorrow
[14:51:07] <mforns>	 nuria_, saw that in the email
[14:51:21] <mforns>	 k
[14:51:45] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161365 (10mforns)
[14:51:49] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161375 (10mforns)
[14:52:19] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161365 (10mforns)
[14:52:20] <wikibugs_>	 10Analytics: Investigate adding user-friendly testing functionality to Reportupdater - https://phabricator.wikimedia.org/T156523#4161376 (10mforns)
[14:52:41] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161365 (10mforns)
[14:52:44] <wikibugs_>	 10Analytics, 10Easy: Reportupdater: do not write execution control files in source directories - https://phabricator.wikimedia.org/T173604#4161378 (10mforns)
[14:53:53] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161381 (10mforns)
[14:53:55] <wikibugs_>	 10Analytics: Reportupdater writes a README in the output folder - https://phabricator.wikimedia.org/T163134#4161380 (10mforns)
[14:54:10] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161365 (10mforns)
[14:54:12] <wikibugs_>	 10Analytics: Add templating support to reportupdater scripts - https://phabricator.wikimedia.org/T163252#4161382 (10mforns)
[14:54:58] <wikibugs_>	 10Analytics: Add templating support to reportupdater scripts - https://phabricator.wikimedia.org/T163252#3191334 (10mforns) Added this task as a subtask of T193167 for the record, but I think this can be closed as resolved or invalid.
[14:56:13] <wikibugs_>	 10Analytics: [reportupdater] add hourly granularity - https://phabricator.wikimedia.org/T193168#4161389 (10mforns)
[14:56:21] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161399 (10mforns)
[14:56:24] <wikibugs_>	 10Analytics: [reportupdater] add hourly granularity - https://phabricator.wikimedia.org/T193168#4161398 (10mforns)
[15:02:42] <wikibugs_>	 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4161424 (10mforns)
[15:02:49] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161436 (10mforns)
[15:02:51] <wikibugs_>	 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4161435 (10mforns)
[15:10:32] <wikibugs_>	 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4161463 (10mforns)
[15:10:53] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161475 (10mforns)
[15:10:55] <wikibugs_>	 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4161474 (10mforns)
[15:13:18] <icinga-wm>	 PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1005 is CRITICAL: Return code of 255 is out of bounds
[15:13:52] <ottomata>	 hm
[15:16:19] <elukey>	 ah it works!
[15:16:24] <elukey>	 stat1005 died
[15:16:30] <elukey>	 :)
[15:17:01] <milimetric>	 what?  I can ssh into it
[15:17:05] <milimetric>	 oh before
[15:17:23] <elukey>	 yeah, and /mnt/hdfs is not usable
[15:17:42] <wikibugs_>	 10Analytics: [reportupdater] consider not requiring date as a first colum of wuery/script results - https://phabricator.wikimedia.org/T193174#4161508 (10mforns)
[15:17:47] <elukey>	 (got the message and logged in to check :)
[15:17:48] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161519 (10mforns)
[15:17:50] <wikibugs_>	 10Analytics: [reportupdater] consider not requiring date as a first colum of wuery/script results - https://phabricator.wikimedia.org/T193174#4161518 (10mforns)
[15:18:31] <wikibugs_>	 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161508 (10mforns)
[15:19:03] <ottomata>	 i umounted and mounted it
[15:19:06] <ottomata>	 (my internet keeps dying!)
[15:19:06] <elukey>	 super
[15:19:14] <ottomata>	 the alert works though yay!
[15:22:16] <elukey>	 going afk again to finish the move
[15:22:34] <elukey>	 (ETOOMANYBOXESTODAY)
[15:24:08] <ottomata>	 laterrrs
[15:26:09] <wikibugs_>	 10Analytics, 10Analytics-Kanban: Productionize EventLoggingSanitization - https://phabricator.wikimedia.org/T193176#4161574 (10mforns)
[15:27:41] <wikibugs_>	 10Analytics, 10Analytics-Kanban: Productionize EventLogging sanitization - https://phabricator.wikimedia.org/T193176#4161574 (10mforns)
[15:43:18] <icinga-wm>	 RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1005 is OK: OK
[15:44:37] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161630 (10Ottomata)
[15:50:36] <mforns>	 ottomata, I'm not sure where the cron job for EventLoggingSanitization.scala needs to go. There's this refine.pp... the ELSanitization could be a refine job, but it needs to parse and validate the whitelist, so it has an extra previous step, that makes it not suited for refine.pp no? The other option I saw is data_drop.pp. What do you think?
[15:55:33] <ottomata>	 hm
[15:55:42] <ottomata>	 mforns:  i'd be fine with either, maybe we can rename data_drop.pp
[15:56:24] <ottomata>	 refinery::job::purge
[15:56:24] <ottomata>	 ?
[15:57:16] <ottomata>	 or mforns we could add an $extra_ops param to refine_job
[15:57:17] <ottomata>	 ?
[15:57:24] <ottomata>	 hm, naw it is a different main class
[15:57:28] <ottomata>	 naw let's just make it a one of cron job
[15:57:54] <fdans>	 a-team i’ll be a couple min late to standup
[15:59:17] <mforns>	 ottomata, ok
[16:02:50] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161741 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1054.eqiad.wmnet'] ``` T...
[16:02:53] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4161742 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1055.eqiad.wmnet'] ``` T...
[16:04:43] <wikibugs_>	 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Goal, and 2 others: FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus - https://phabricator.wikimedia.org/T190327#4161768 (10mobrovac)
[16:27:13] <wikibugs_>	 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959#4161862 (10fdans) p:05Triage>03Normal
[16:27:20] <wikibugs_>	 10Analytics, 10User-Elukey: Only hdfs (or authenticated user) should be able to run Druid indexing jobs - https://phabricator.wikimedia.org/T192959#4155423 (10fdans) p:05Normal>03Low
[16:29:23] <wikibugs_>	 10Analytics: Under construction page in wikistats to take site down - https://phabricator.wikimedia.org/T192847#4161880 (10fdans) p:05Triage>03High
[16:30:11] <wikibugs_>	 10Analytics, 10Analytics-Kanban: Productionize EventLogging sanitization - https://phabricator.wikimedia.org/T193176#4161884 (10fdans) p:05Triage>03High
[16:31:47] <wikibugs_>	 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161887 (10fdans) p:05Triage>03Unbreak!
[16:31:55] <wikibugs_>	 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161891 (10fdans) p:05Unbreak!>03Triage
[16:32:29] <wikibugs_>	 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (watching): Enable snappy compression for eventbus Kafka producer - https://phabricator.wikimedia.org/T193080#4161892 (10fdans) p:05Triage>03Normal
[16:32:46] <wikibugs_>	 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4161896 (10mforns) p:05Triage>03Unbreak!
[16:32:48] <wikibugs_>	 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161900 (10mforns) p:05Triage>03Unbreak!
[16:32:59] <wikibugs_>	 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4161902 (10mforns) p:05Triage>03Unbreak!
[16:33:01] <wikibugs_>	 10Analytics: [reportupdater] Add a configurable hive client - https://phabricator.wikimedia.org/T193169#4161906 (10mforns) p:05Triage>03Unbreak!
[16:33:04] <wikibugs_>	 10Analytics, 10Commons, 10EventBus, 10MediaWiki-JobQueue, and 3 others: Make gwtoolsetUploadMediafileJob JSON-serializable - https://phabricator.wikimedia.org/T192946#4161910 (10fdans) p:05Normal>03Triage
[16:33:09] <wikibugs_>	 10Analytics: [reportupdater] add hourly granularity - https://phabricator.wikimedia.org/T193168#4161913 (10mforns) p:05Triage>03Unbreak!
[16:33:22] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161919 (10mforns) p:05Triage>03Unbreak!
[16:33:24] <wikibugs_>	 10Analytics: reportupdater TLC - https://phabricator.wikimedia.org/T193167#4161926 (10mforns) p:05Unbreak!>03Triage
[16:33:26] <wikibugs_>	 10Analytics, 10Collaboration-Team-Triage, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Make EchoNotification job JSON-serialiable - https://phabricator.wikimedia.org/T192945#4161923 (10fdans) p:05Normal>03Triage
[16:33:38] <wikibugs_>	 10Analytics: [reportupdater] Add a configurable hive client - https://phabricator.wikimedia.org/T193169#4161927 (10mforns) p:05Unbreak!>03Triage
[16:34:06] <wikibugs_>	 10Analytics: [reportupdater] Allow defaults for all config parameters - https://phabricator.wikimedia.org/T193171#4161929 (10mforns) p:05Unbreak!>03Triage
[16:34:15] <wikibugs_>	 10Analytics: [reportupdater] eliminate the funnel parameter - https://phabricator.wikimedia.org/T193170#4161930 (10mforns) p:05Unbreak!>03Triage
[16:34:25] <wikibugs_>	 10Analytics: [reportupdater] add hourly granularity - https://phabricator.wikimedia.org/T193168#4161931 (10mforns) p:05Unbreak!>03Triage
[16:34:59] <wikibugs_>	 10Analytics: [reportupdater] consider not requiring date as a first colum of query/script results - https://phabricator.wikimedia.org/T193174#4161933 (10mforns) p:05Unbreak!>03Triage
[16:47:01] <milimetric>	 joal: https://github.com/wikimedia/analytics-refinery/blob/master/hive/webrequest/check_dataloss_false_positives.hql
[16:47:06] <milimetric>	 joal: https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/load/generate_sequence_statistics.hql
[17:32:13] <wikibugs_>	 10Analytics, 10Cassandra, 10Maps-Sprint, 10Operations, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10Eevans) The restbase cluster has been upgraded package-wise, but a rolling restart still needs to be scheduled.
[17:34:20] <wikibugs_>	 10Analytics, 10Cassandra, 10Maps-Sprint, 10Operations, and 4 others: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948#4154846 (10MoritzMuehlenhoff) >>! In T192948#4162125, @Eevans wrote: > The restbase cluster has been upgraded package-wise, but a rolling...
[17:38:26] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4162150 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1055.eqiad.wmnet'] ```  and were **ALL** successful.
[17:38:41] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557#4162151 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1054.eqiad.wmnet'] ```  and were **ALL** successful.
[17:40:08] <icinga-wm>	 PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode
[17:41:00] <elukey>	 whaaaa
[17:41:09] <ottomata>	 whaaa
[17:41:10] <ottomata>	 looking
[17:42:14] <ottomata>	 elukey:  i see no reason why it shut down
[17:42:18] <ottomata>	 just that it did?
[17:42:40] <ottomata>	 1002 has taken over as active
[17:43:01] <ottomata>	 starting it back up..
[17:43:14] <elukey>	 java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
[17:43:19] <ottomata>	 oh
[17:43:20] <ottomata>	 zk?
[17:43:23] <ottomata>	 journalnodes?
[17:43:31] <elukey>	 it might be either of the two
[17:43:55] <elukey>	 2018-04-26 17:37:58,103 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.64.36.128:8485, 10.64.53.14
[17:43:59] <elukey>	 :8485, 10.64.5.15:8485], stream=QuorumOutputStream starting at txid 1886903961))
[17:45:52] <joal>	 hm
[17:46:29] <ottomata>	 i didn't reboot any journanodes
[17:47:32] <elukey>	 so in theory when two out of three jn are not available the namenode shutsdown
[17:47:46] <ottomata>	 ?
[17:47:49] <ottomata>	 oh two out of three
[17:47:49] <ottomata>	 ya
[17:48:00] <ottomata>	 they seem fine though
[17:49:17] <elukey>	 yeah I don't find anything 
[17:50:05] <elukey>	 ottomata: as follow up, should we change the pager to analytics only?
[17:54:24] <elukey>	 I can see a ton of INFO BlockStateChange: BLOCK* InvalidateBlocks: add blk_1287766178_214099831 to
[17:54:27] <elukey>	 right before the issue
[17:57:06] <elukey>	 it is weird, they started at 11 UTC 
[17:57:19] <ottomata>	 wow elukey my computer just hard crashed
[17:57:19] <ottomata>	 sorry
[17:57:32] <elukey>	 no problem :)
[17:57:45] <joal>	 elukey, ottomata: spike in processes and load on an1028, an1035 and an1052 at 17:37 UTC - Could be related?
[17:58:02] <ottomata>	 sounds related
[17:58:03] <ottomata>	 all at once
[17:58:04] <ottomata>	 ?
[17:58:06] <joal>	 yup
[17:58:10] <joal>	 on the 3 nodes
[17:58:16] <ottomata>	 were there like a huge number of edits to make or something?
[17:58:23] <ottomata>	 lots of files being created/deleteed/moved?
[17:58:47] <joal>	 ottomata: doesn't seem related to IOs
[17:59:29] <wikibugs_>	 10Analytics, 10Beta-Cluster-Infrastructure, 10ChangeProp, 10EventBus, and 3 others: Puppet broken on deployment-cpjobqueue - https://phabricator.wikimedia.org/T193127#4162216 (10mobrovac)
[18:00:10] <elukey>	 I should have added the RPC metrics for journalnodes
[18:00:21] <elukey>	 they are not in the dashboard yet, lemme add them so we can see
[18:00:51] <elukey>	 joal: very interesting correlation with your observation
[18:00:52] <elukey>	 https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&panelId=13&fullscreen&var-server=analytics1035&var-datasource=eqiad%20prometheus%2Fops
[18:01:18] <joal>	 Nice
[18:01:47] <joal>	 From hadoop, only a single job finished around that time - I had nothing special I think
[18:03:44] <ottomata>	 weird
[18:04:07] <ottomata>	 elukey:  shall I just restart 1001 and transition it to active?
[18:04:14] <ottomata>	 and see if it happens again?
[18:04:23] <ottomata>	 tg for standby nn and auto failover! :)
[18:05:35] <joal>	 I can access UI, elukey must have done it already
[18:05:49] <joal>	 ottomata: --^
[18:06:05] <elukey>	 nope I haven't touched anyhing
[18:06:17] <ottomata>	 joal:  not yarn
[18:06:18] <ottomata>	 namenode
[18:06:21] <elukey>	 ottomata: shall we set the page to analytics only?
[18:06:25] <ottomata>	 yes
[18:06:32] <joal>	 Ah !
[18:07:31] <ottomata>	 elukey:  i'm starting up the name node
[18:07:43] <elukey>	 ok
[18:09:19] <icinga-wm>	 RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode
[18:10:54] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162236 (10Ottomata)
[18:12:23] <ottomata>	 ok elukey its started and safemode has been turned off
[18:12:28] <ottomata>	 i'm going to promote it back to active
[18:13:02] <ottomata>	 oh, we have to just boucne nn on 1002 right?
[18:13:23] <ottomata>	 oh failover riiight
[18:14:26] <ottomata>	 (done)
[18:14:34] <elukey>	 2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 20579 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
[18:14:38] <elukey>	 2018-04-26 17:37:58,103 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 20557ms
[18:14:41] <elukey>	 2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.36.128:8485
[18:14:45] <elukey>	 2018-04-26 17:37:58,103 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 20579ms to send a batch of 2 edits (423 bytes) to remote journal 10.64.53.14:8485
[18:14:49] <elukey>	 2018-04-26 17:37:58,103 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.64.36.128:8485, 10.64.53.14:8485, 10.64.5.15:8485], stream=QuorumOutputStream starting at txid 1886903961))
[18:14:53] <elukey>	 so these are the only things on an1001 that are relevant
[18:14:55] <elukey>	 the GC pause is horrible
[18:14:56] <elukey>	 20s
[18:15:04] <joal>	 :(
[18:16:09] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162250 (10Ottomata)
[18:16:50] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4151019 (10Ottomata)
[18:17:26] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4151019 (10Ottomata)
[18:18:06] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162258 (10Ottomata)
[18:18:25] <elukey>	 also another interesting thing
[18:18:26] <elukey>	 https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?orgId=1&panelId=41&fullscreen&from=now-3h&to=now
[18:18:44] <elukey>	 from ~16:10 onwards there were a ton of under replicated blocks
[18:18:48] <ottomata>	 hm
[18:20:17] <elukey>	 but I don't see anything weird on the journal node metrics
[18:21:22] <elukey>	 but we know for sure that there was a spike in load and time-waits on the journal nodes
[18:23:04] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4151019 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['ka...
[18:32:57] <elukey>	 so from the journalnode metrics I don't see anything that clearly points to them for this
[19:04:09] <elukey>	 ok everything seems stable, I'll review metrics tomorrow morning with joal :)
[19:04:32] <elukey>	 I hope it is not related to the prometheus agent, it is the only thing that we changed recently
[19:04:57] <elukey>	 but no signs that this is the case
[19:05:51] <elukey>	 joal: last one https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&from=now-3h&to=now-1m&var-server=analytics1002&var-datasource=eqiad%20prometheus%2Fops
[19:06:02] <elukey>	 so analytics1002 was showing some signs of overload too
[19:06:10] <ottomata>	 elukey:  qq, you are moving prom jmx exproter yaml files into /etc/promethteus?
[19:06:12] <ottomata>	 correct?
[19:06:22] <ottomata>	 just had the race condition for /etc/kafka when reinstalling
[19:06:28] <ottomata>	 i could enforce kafka package first
[19:06:30] <ottomata>	 but not sure if that is right hting
[19:07:06] <elukey>	 ottomata: I already did it for all the hadoop workers, and I am planning to do the same with druid.. it makes live easier
[19:07:43] <ottomata>	 ok, just /etc/prometheus, right?
[19:08:57] <elukey>	 yep
[19:10:05] <elukey>	 all right, going offline again, talk with you guys (hopefully) tomorrow :)
[19:10:08] <elukey>	 byyeee
[19:10:08] <elukey>	 o/
[19:15:30] <joal>	 Thanks a lot Luca :)
[19:15:50] <joal>	 milimetric: if you're anywhere nearby I think I have the explanation for the webrequests stats
[19:22:09] <joal>	 milimetric: I'm gonna stop working today, lete's discuss tomorrow :)
[19:22:44] <milimetric>	 Oh nice joal, I’m at the doctor still, but if you can tell me over irc I’m all ears
[19:23:10] <milimetric>	 oh, sorry, nite!  See you next week, off tomorrow
[19:23:16] <joal>	 milimetric: problem wass with '-' dt values
[19:23:47] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162467 (10Ottomata)
[19:25:06] <joal>	 In WSS, we remove them from our counts using sequences - therefore the min/max diff
[19:26:24] <joal>	 Solution is to remove them from the current_hour part of the FPC query, for those sequence-numbers not to mess up the counts
[19:26:29] <joal>	 I'll send a patch tomorrow :)
[19:26:33] <joal>	 milimetric: --^
[19:26:44] <joal>	 I'm gone after that, more details next week :)
[19:30:27] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162470 (10Ottomata)
[19:31:26] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['kafka2001.codfw.wmnet'] ```  and were **ALL** succ...
[19:39:34] <wikibugs_>	 10Analytics, 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Upgrade to Stretch and Java 8 for Kafka main cluster - https://phabricator.wikimedia.org/T192832#4162496 (10Ottomata) Alright, kafka2001 is now Stretch.  Waiting until Monday to proceed with more.
[19:50:44] <wikibugs_>	 10Analytics-Kanban, 10EventBus, 10Patch-For-Review, 10Services (watching): Enable snappy compression for eventbus Kafka producer - https://phabricator.wikimedia.org/T193080#4162511 (10Ottomata)