[00:49:47] PROBLEM - Check the last execution of monitor_refine_eventlogging_analytics_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:27:24] 10Analytics: MEP: canary events so we know events are flowing through pipeline - https://phabricator.wikimedia.org/T250844 (10Nuria) [05:59:20] RECOVERY - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate_failure_flags on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:59:50] this will alarm again in a bit --^ [06:00:22] now the timer works properly, and we have failed flags for eventError [06:02:46] RECOVERY - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_delayed_failure_flags on an-launcher1001 is OK: OK: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:14:17] 10Analytics: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10fdans) A note on DIY compression vs adding all values: Erik's format very clearly saves space. I wasn't taking into account the long tail of articles with a couple of pageviews per day that with... [06:22:30] PROBLEM - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_immediate_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_immediate_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:23:14] yep! [06:23:40] ah no ok my comment above about evenError is not pertinent with sanitize of course [06:24:16] the issue is FATAL RefineFailuresChecker: Failed loading configuration: input_path_regex is required but was not provided. Aborting. [06:25:51] yeah input_path_regex is not in the config [06:26:08] PROBLEM - Check the last execution of monitor_refine_sanitize_eventlogging_analytics_delayed_failure_flags on an-launcher1001 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_sanitize_eventlogging_analytics_delayed_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:38:10] joal: hello! when you have a couple minutes I'd love to get your opinion on an approach I'm trying to get a separate file for each pageview agent_type [06:40:04] hola fdans [06:40:16] elukey: o/ [07:04:26] all right all alerts in icinga are cleared, I acked one for el failure flag checks [07:05:40] going afk, but ping me on the phone if anything explodes :) [07:29:35] * joal is happy, mediawiki-history succeeded! our analysis was correct elukey :) [07:29:47] * joal will check for geoeditors job later [07:40:52] 10Analytics: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10fdans) [10:22:22] joal: https://github.com/internetarchive/snakebite-py3/issues/8#issuecomment-625748266 [10:22:49] I cannot count the amount of hours that I tried to understand why it wasn't working [10:23:08] so maybe I'll be able to make snakebite to work with RPC encryption [10:23:18] we'll see :) [11:07:59] (03PS1) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [11:16:30] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10Cmjohnson) This server is out of warranty, @wiki_willy we can order a new 4TB disk. [11:24:04] (03PS2) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [11:28:42] (03PS3) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [11:36:47] (03PS4) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [11:39:00] (03PS5) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [11:39:34] 10Analytics, 10Analytics-Kanban, 10Research, 10Patch-For-Review: Proposed adjustment to wmf.wikidata_item_page_link to better handle page moves - https://phabricator.wikimedia.org/T249773 (10Milimetric) Oh that makes sense, the query in use right now has the same problem because it's not filtering out page... [12:16:12] (03PS6) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [12:21:29] (03PS7) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [13:17:40] o/ [13:19:54] 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Performance-Team: CentralNotice banners shouldn't be served to bots - https://phabricator.wikimedia.org/T252200 (10Gilles) [13:22:32] 10Analytics, 10Analytics-Kanban: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10mforns) @Nuria I understand now what you mean. Then, I'll also add some code to the RSVD module that discards time series that are too sparse (too many zero/default valu... [13:27:51] (03PS8) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [13:29:21] (03PS9) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [13:43:11] hey ottomata :] is the missing hour for EventError something we need to fix? [13:43:17] if so, I'll look into it [13:44:15] (03PS10) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [13:56:49] (03PS11) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [13:59:05] mforns: we could but i didn't think it was worth it [13:59:11] no on looks at that data i think [13:59:13] no one [13:59:30] i dunno even if we need to sanitize that one? [13:59:47] i guess it is nice for historical stuff to see what was invalid, but if we sanitize it we will ahve to sanitize the raw_events fiedl [14:04:22] (03PS12) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [14:04:27] aha [14:05:35] ottomata: the `event`.`mediawiki_api_request` issue was an OOM in the executor, do we need to bum that up in puppet? [14:05:58] curious that it ran now, though [14:08:34] ottomata: hola, can we merge these two turnilo changes (just tested) and bounce turnilo? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594472/ [14:08:41] 10Analytics: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 (10Ottomata) [14:08:48] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/594272/ [14:09:18] sure merging [14:19:56] mforns: possibly ya [14:21:26] 10Analytics: Geoeditors job is faliing due to problems with geo udf - https://phabricator.wikimedia.org/T252205 (10Nuria) [14:21:44] 10Analytics, 10Analytics-Kanban: Geoeditors job is faliing due to problems with geo udf - https://phabricator.wikimedia.org/T252205 (10Nuria) [14:23:45] nuria: bunping up the refinery_jar_version [14:23:50] *bumping [14:23:54] mforns: i just did [14:23:56] wait [14:24:00] mforns: about to push [14:25:12] ohhh!!! [14:25:15] xD [14:25:15] (03PS1) 10Mforns: Bump up refinery_jar_version of geoeditors monthly after UDF fix [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595182 (https://phabricator.wikimedia.org/T251542) [14:25:17] just pushed too [14:25:29] sorry... [14:26:07] (03PS1) 10Nuria: Bumping up jar for geoeditors job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595183 (https://phabricator.wikimedia.org/T252205) [14:26:36] mforns: nvm my push took forever for some reason [14:26:43] (03Abandoned) 10Nuria: Bumping up jar for geoeditors job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595183 (https://phabricator.wikimedia.org/T252205) (owner: 10Nuria) [14:27:01] my baaaaad [14:27:09] (03CR) 10Nuria: [C: 03+2] Bump up refinery_jar_version of geoeditors monthly after UDF fix [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595182 (https://phabricator.wikimedia.org/T251542) (owner: 10Mforns) [14:27:12] (03CR) 10Nuria: [V: 03+2 C: 03+2] Bump up refinery_jar_version of geoeditors monthly after UDF fix [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595182 (https://phabricator.wikimedia.org/T251542) (owner: 10Mforns) [14:27:21] mforns: merged now [14:27:26] nuria: deploy? [14:27:39] or can it wait to next week [14:27:55] mforns: it can wait , just we need to remember to restart [14:28:13] will put it in the etherpad [14:28:26] (03PS13) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [14:28:39] ok, you beat me [14:29:17] mforns: good, [14:29:30] mforns: jaja [14:29:36] mforns: thanks for the fast response [14:29:47] nuria: hmm I messed up! I put the wrong task ID in the bump up gerrit patch [14:30:32] nuria: wait, it has not been merged yet [14:30:42] I will abandon too, and create another one [14:30:44] mforns: I'm sorry for the SLA alarms!! [14:31:18] fdans: xD no problem, if I had to pay 1 dollar for each one of those I let slip... [14:32:02] (03Abandoned) 10Mforns: Bump up refinery_jar_version of geoeditors monthly after UDF fix [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595182 (https://phabricator.wikimedia.org/T251542) (owner: 10Mforns) [14:32:41] mforns: ok, will be in meeting for abit, we can merge after [14:32:53] np, I can self-merge [14:36:19] (03PS1) 10Mforns: Bump up refinery_jar_version of geoeditors monthly after UDF fix [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595185 (https://phabricator.wikimedia.org/T252205) [14:37:49] (03CR) 10Mforns: [V: 03+2 C: 03+2] Bump up refinery_jar_version of geoeditors monthly after UDF fix [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595185 (https://phabricator.wikimedia.org/T252205) (owner: 10Mforns) [14:41:37] (03PS14) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [14:45:24] (03PS1) 10Ottomata: bin/camus - Fix undefined extra_java_opts when executing checker [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595187 [14:45:37] (03PS15) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [14:46:50] (03PS2) 10Ottomata: bin/camus - Fix undefined extra_java_opts when executing checker [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595187 [14:50:11] (03CR) 10Ottomata: [V: 03+2 C: 03+2] bin/camus - Fix undefined extra_java_opts when executing checker [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595187 (owner: 10Ottomata) [14:53:45] (03PS1) 10Mforns: Make anomaly detection correctly handle holes in time-series [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/595189 (https://phabricator.wikimedia.org/T251542) [14:55:19] (03PS16) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [14:55:31] (03CR) 10Mforns: [C: 04-2] "Still needs testing." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/595189 (https://phabricator.wikimedia.org/T251542) (owner: 10Mforns) [14:59:30] joal: are you off today? [15:03:24] ping milimetric joal [15:14:28] (03PS17) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [15:16:44] !log restarted turnilo after applying nuria and mforns changes [15:16:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:27:06] !log stopping kafka broker on kafka-jumbo1006 to investigate camus import failures - T252203 [15:27:08] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:27:08] T252203: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 [15:28:14] 10Analytics: Study whether we need to increase pageview-hourly SLA after adding automated tag - https://phabricator.wikimedia.org/T252211 (10mforns) [15:29:21] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add new dimensions to druid's pageview_hourly datasource - https://phabricator.wikimedia.org/T243090 (10Nuria) 05Open→03Resolved [15:29:24] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Product-Analytics (Kanban): Add dimensions for Project type & language to Edits_hourly - https://phabricator.wikimedia.org/T232659 (10Nuria) [15:29:34] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add new dimensions to druid's pageview_hourly datasource - https://phabricator.wikimedia.org/T243090 (10Nuria) turnilo is displaying the new dimensions [15:34:13] (03PS18) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [15:35:19] Hi ping people - ottomata, fdans, elukey - I'm indeed off today - Wanted to double check now the geoeditors failure but Marcel and Nuria found the error already :) [15:35:54] elukey: mammoth debugging on snakebite-sasl [15:36:43] !log starting kafka broker on kafka-jumbo1006, same issue on other brokers when they are leaders of offending partitions - T252203 [15:36:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:36:46] T252203: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 [15:36:59] ottomata: anything urgent? [15:37:32] joal: camus is not importing 3 partitions, 2 of api-request, one of cirrussearch. no idea why yet [15:37:34] camus logs just say [15:37:37] 2020-05-08 15:30:48,267 INFO [main] com.linkedin.camus.etl.kafka.common.KafkaReader: Connected to leader tcp://kafka-jumbo1003.eqiad.wmnet:9092 beginning reading at offset 11386686628 latest offset=11758854198 [15:37:37] 2020-05-08 15:30:48,614 INFO [main] com.linkedin.camus.etl.kafka.mapred.EtlRecordReader: Records read : 0 [15:37:43] but i think we'll figurue it out [15:37:48] dan is dumping the data for those partitions [15:37:51] just in case [15:37:56] hm [15:38:29] This is weird [15:38:54] looks simliar to what happened in sept. [15:38:54] https://phabricator.wikimedia.org/T233718#5523778 [15:40:04] yeah exact same thing [15:40:26] I was also coming to hte same conclusion: exact same problem [15:40:37] ottomata: can it be related to data-volume? [15:40:46] ottomata: like too big of partitions? [15:48:02] (03PS19) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [15:58:17] joal: seems unlikey [15:58:22] webrequest is way huger [15:58:31] ottomata: per partition? [15:59:00] joal: do you know, does camus use the offsets-m-* files to find offsets to start from, or does it use the offets.previous file [15:59:01] ? [15:59:04] offests-previous [15:59:08] ottomata: I remember we were having a small number of partitions for some big topics [15:59:15] there are 12 partitions [15:59:18] for both of these topics [16:01:06] hm - doesn't seem too small [16:01:23] i'm guessing it uses offests.previous [16:01:25] ottomata: I *think* camus uses offsets-m-* files [16:01:28] hmmm [16:01:37] but maybe not :S [16:01:39] well, for these bad partitions [16:01:47] there is no offsets-m file for them [16:01:56] i think maybe camus is only writing those for partitions it reads data from? [16:02:09] there is an offsets entry in offsets.previous for them though [16:02:16] so it must be getting the offset to start from from there [16:02:42] makes sense [16:02:53] ottomata: can we try to force camus to start with the previous history? [16:03:12] just the one before the one failing - instead of restarting for 0 [16:03:55] i was trying to figure out how to just make camus reset to 0 or whatever i don't care for thohse partitions [16:04:01] but i can't just delete an individual file i guess [16:04:13] ottomata: hm - maybe? [16:04:22] i'd have to manually write out the EtlKey offsets-previous file with different values [16:04:24] i think [16:08:12] 10Analytics, 10Analytics-Kanban: Bump up SLA of pageview jobs after deploying bots check - https://phabricator.wikimedia.org/T252220 (10Nuria) [16:08:32] mforns: created ticket for SLA bump [16:10:46] I can't seem to get on hive today. I get this error "Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)" [16:10:50] am I doing something wrong? [16:11:38] Jdlrobson: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide [16:12:51] got it. thanks joal . looks like i need to setup some credentials [16:13:15] indeed Jdlrobson :) [16:13:28] thanks for saving me some time :) [16:13:53] np Jdlrobson - pinging here is usually the best solution (or email :) [16:14:11] ottomata: do you need me or may I leave ? [16:14:24] joal: you can leave i'm just reading some camus source [16:14:31] :S [16:14:32] thanks! i'm sure i'll figure it out have a good day! [16:14:46] best of luck ottomata - I'll be back online later tonight [16:15:35] (03PS20) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [16:16:11] 10Analytics: Request a Kerberos identity for jrobson - https://phabricator.wikimedia.org/T252222 (10Jdlrobson) [16:17:08] i think there is somethign to theh fact that there is noo offsets-m file for these partitions [16:17:08] hmm [16:22:39] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10Nuria) >Then, if we use the 90% threshold, the time series would be potentially discarded for several weeks/months after the recovery, right? Is thi... [16:25:47] 10Analytics: SQL query failed on superset SQL lab - https://phabricator.wikimedia.org/T252225 (10jwang) [16:27:27] maybe this will fix itself, but i'm getting really high latency (takes a long time for commands i type to show up etc.) on stat1004 that isn't happening for me on stat1007 right now [16:28:37] milimetric: is running a data capture there just in case, we are in risk of losing some data due to a camus bug atm: T252203 [16:28:38] T252203: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 [16:28:43] buuut [16:28:46] could probbably done better [16:29:01] milimetric: just checkking you did specify the partition in your kafkacat commands, ya? [16:29:06] ahh ok, i'll just stay away from it then :) (i just need cluster so i ca do that via stat1007) [16:29:09] k [16:29:22] ottomata: kafkacat commands are: [16:29:24] https://www.irccloud.com/pastebin/nuqabPab/ [16:29:30] with partition and offsets as in the task [16:29:30] 10Analytics, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Performance-Team: CentralNotice banners shouldn't be served to bots - https://phabricator.wikimedia.org/T252200 (10AndyRussG) [16:29:51] ottomata: it's running on stat1004, saving in my home [16:30:07] ok [16:31:25] took about 230 GB so far, for only one day, might want to stop it soon, but I checked df and that was about 4% of disk on the mount my home is on, it's still low usage at 40% [16:34:01] aye [16:34:05] i think stat1007 has more space [16:34:17] and 1008 [16:34:20] and 1005 [16:34:33] yarrr [16:34:37] not getting much farther here... [17:08:12] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10mforns) > > Then, if we use the 90% threshold, the time series would be potentially discarded for several weeks/months after the recovery, right? Is... [17:13:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10Nuria) >Should it be responsibility of the RSVD anomaly detection module to check whether the data is OK? Agreed that it should be the caller of the... [17:22:00] (03PS21) 10Fdans: [wip]Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [17:22:49] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Make anomaly detection correctly handle holes in time-series - https://phabricator.wikimedia.org/T251542 (10mforns) > Agreed that it should be the caller of the module the one doing that OK, but the caller of RSVDAnomalyDetection.scala is not aware of the... [17:23:22] (03PS22) 10Fdans: Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [17:25:45] (03PS23) 10Fdans: Add pageview daily dump oozie job to replace Pagecounts-EZ [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) [17:28:25] (03CR) 10Fdans: "This is now tested and ready to review" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/595152 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans) [17:33:57] ottomata: it looks like it has enough space, I'm less than 1TB and almost done. I gotta run out, but I'll leave it running. This means we'll be covered for another 7 days of trying so worst case we can keep debugging next week, right? [17:34:00] or is there more urgency? [17:34:45] ok [17:34:49] yes [17:34:55] that sounds right milimetric thank you [17:42:39] i'm trying to look into camus history to fnid the last run that successfully imported data for one of these partitions [17:53:20] 10Analytics, 10CPT Initiatives (Revision Storage Schema Improvements), 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Jdforrester-WMF) [17:53:45] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10wiki_willy) It looks like the 5yr server lifecycle will be ending next month. @elukey - would it be possible to decom this server instead? Thanks, Willy [17:53:56] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10wiki_willy) a:03wiki_willy [18:07:51] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10elukey) @wiki_willy yes no problem, we are going to refresh this node soon! [18:08:43] 10Analytics, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T252070 (10wiki_willy) Thanks @elukey [18:19:03] k, I'm putting the backups onto hdfs, they're current as of right now. they'll be in /user/milimetric/camus-stuck-on* [18:19:57] great [18:19:58] thank you [18:22:44] 10Analytics: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 (10Ottomata) This is a reoccurrence of {T233718}. [18:34:14] 10Analytics, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) The servers have been moved to 10G racks, in order to keep 2 in row D, KJ1008/1009 are in the same rack, D7. Once we are able to get a 3rd switc... [18:34:38] 10Analytics, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) Network switch has been updated, old entries removed and ports disabled. [18:56:26] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kafka-jumbo1007.eqiad.wmne... [18:59:54] 10Analytics, 10Patch-For-Review: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10Isaac) Thanks @fdans for leading this work -- page view data and how to handle all the different potential sources is definitely one of the most-FAQ of frequently asked que... [19:09:43] 10Analytics, 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10Cmjohnson) {F31808337} @elukey, it doesn't appear to be a partman thing. Attached is a picture of the console monitor during the ini... [19:15:43] 10Analytics, 10CPT Initiatives (Revision Storage Schema Improvements), 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Ladsgroup) I assume next step here is for @Marostegui to apply... [19:20:25] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban), 10KaiOS-Wikipedia-app (MVP), 10Patch-For-Review: Capture and send back client-side errors - https://phabricator.wikimedia.org/T248615 (10SBisson) @jlinehan @Ottomata we would need a place to put the app version. I tried adding `meta.appVersion` and it... [19:23:36] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban), 10KaiOS-Wikipedia-app (MVP), 10Patch-For-Review: Capture and send back client-side errors - https://phabricator.wikimedia.org/T248615 (10Ottomata) Can you put it in the tags map? https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki... [19:33:49] 10Analytics, 10Event-Platform, 10Inuka-Team (Kanban), 10KaiOS-Wikipedia-app (MVP), 10Patch-For-Review: Capture and send back client-side errors - https://phabricator.wikimedia.org/T248615 (10SBisson) @Ottomata thanks, it works well. [19:53:34] 10Analytics, 10CPT Initiatives (Revision Storage Schema Improvements), 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10DannyS712) >>! In T215466#6119796, @Ladsgroup wrote: > I assume... [19:56:37] 10Analytics, 10Operations, 10ops-eqiad: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kafka-jumbo1007.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kafka-jumbo1007.eqiad.wmnet'] ` [20:12:18] 10Analytics, 10CPT Initiatives (Revision Storage Schema Improvements), 10Epic, 10MW-1.35-notes (1.35.0-wmf.32; 2020-05-12), 10Technical-Debt: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 (10Jdforrester-WMF) [20:39:01] 10Analytics: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 (10Ottomata) No idea why this happens. I did notice that after each run, since no data is written for these partitions, camus saving their offset... [20:45:43] 10Analytics: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 (10Ottomata) To replay: ` kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t eqiad.mediawiki.api-request -p 2 -o 11388074561 -c 10001 > eqiad.med... [21:06:29] !log running prefered replica election for kafka-jumbo to get preferred leaders back after reboot of broker earlier today - T252203 [21:06:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:06:32] T252203: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 [21:07:03] 10Analytics: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 (10Ottomata) >:( I spoke too soon. From a recent run 2020-05-08-20-45-15 application_1583418280867_339736: ` topic:eqiad.mediawiki.api-requ... [21:56:50] 10Analytics: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 (10Ottomata) FINE. Let's try to skip at least an hour of data. `lang=scala // api-requset is about 9000 / second and has 12 partitions. Let's... [22:18:52] 10Analytics: Camus failing to import eqiad.mediawiki.(api|cirrussearch)-request from partitions leaders on kafka-jumbo1006 - https://phabricator.wikimedia.org/T252203 (10Ottomata) Next run was still good: ` yarn-logs -u analytics application_1583418280867_339909 | grep -A 25 -E 'topic:eqiad.mediawiki.api-reques...