[01:53:21] <icinga-wm>	 PROBLEM - Check unit status of monitor_refine_event_failure_flags on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:41:53] <wikibugs>	 10Analytics, 10Dumps-Generation: Mention QRank in “Analytics Datasets” - https://phabricator.wikimedia.org/T278416 (10ArielGlenn)
[05:49:32] <wikibugs>	 10Analytics, 10Dumps-Generation: Mention QRank in “Analytics Datasets” - https://phabricator.wikimedia.org/T278416 (10ArielGlenn) We do, but not on that page. If this is related to page views, maybe we should add it directly to https://dumps.wikimedia.org/other/pageviews/readme.html, which could probably use s...
[07:04:42] <elukey>	 good morning
[07:06:54] <joal>	 Good morning
[07:08:33] <elukey>	 bonjour
[07:10:25] <icinga-wm>	 RECOVERY - Check unit status of monitor_refine_event_failure_flags on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_failure_flags https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[08:15:20] <wikibugs>	 10Analytics, 10SRE: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10MoritzMuehlenhoff) Is an-launcher in anyway different than the rest of Hadoop, like different mount options or so? We would have seen that error also happening on the rest of the Hadoo...
[08:20:20] <wikibugs>	 10Analytics, 10SRE: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10elukey) Not that I know, we deploy the /mnt/hdfs fuse mount only on clients (like stat100x, etc..) and we use the same config everywhere. The main issue that I have observed in the pas...
[08:26:58] <wikibugs>	 10Analytics, 10SRE: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10MoritzMuehlenhoff) Hmmh, I added some debug output to wmf-auto-restart on an-launcher1002 and it seems to correctly pick up the config, the executed lsof command is   ` /usr/bin/lsof +...
[08:29:02] <joal>	 elukey: old but interesting: https://a-ghorbani.github.io/2016/12/23/spark-on-yarn-and-java-8-and-virtual-memory-error
[08:31:35] <elukey>	 will read thanks!
[08:32:44] <elukey>	 I am currently reading the example in https://hadoop.apache.org/docs/r2.10.0/hadoop-yarn/hadoop-yarn-site/NodeLabel.html
[08:33:23] <elukey>	 and I am not sure what it is best - a separate GPU queue with FIFO etc.., or allow users.default to use the GPU labels ?
[08:33:34] <elukey>	 maybe the latter could be good for a quick test
[08:33:43] <elukey>	 or for the initial use cases
[08:33:54] <joal>	 elukey: shoudln't we use users.fifo?
[08:34:24] <elukey>	 joal: we could yes, I am not sure who will use that queue though, it was one of the follow up that I wanted to do with you
[08:34:34] <elukey>	 I didn't follow up with Erik
[08:35:02] <joal>	 elukey: IIRC ebernhardson needs it, but I also think it would be a good use case for GPUs
[08:45:30] <elukey>	 so
[08:45:31] <elukey>	         'yarn.scheduler.capacity.root.users.fifo.accessible-node-labels' => 'GPU',
[08:45:34] <elukey>	         'yarn.scheduler.capacity.root.users.fifo.accessible-node-labels.GPU.capacity' => '100',
[08:45:37] <elukey>	         'yarn.scheduler.capacity.root.users.fifo.default-node-label-expression' => 'GPU',
[08:46:12] <joal>	 elukey: I also found that: https://issues.apache.org/jira/browse/YARN-4714
[08:46:56] <elukey>	 joal: yes I was reading it yesterday, it is linked when the container gets killed in the logs, but I didn't find a clear guidance
[08:47:28] <joal>	 elukey: My understanding is they suggest what we did (disable vmem-check)
[08:48:38] <elukey>	 joal: not sure, there are various ideas, I still think that having some barrier/check is good, but disabling is a good compromise in our use case
[08:48:52] <joal>	 Agreed --^
[08:49:27] <joal>	 I like the idea of having a limit, but if the ratio needs to be so high as to be irrealistic, maybe not wiorth :)
[08:50:34] <joal>	 elukey: I'm also gonna investigate the specific cases I have understodd as being problematic
[08:50:58] <elukey>	 very interesting the MALLOC_ARENA_MAX
[08:53:07] <joal>	 I knew you'd like it :)
[08:53:14] <wikibugs>	 10Analytics, 10SRE: wmf-auto-restart.py + lsof + /mnt/hdfs may need to be tuned - https://phabricator.wikimedia.org/T278371 (10elukey) Yep most of the times it works fine, but when the fuse process gets into its weird state then everything trying to access /mnt/hdfs stalls waiting for it (no idea how to trigge...
[08:53:23] <joal>	 NEEEEEEEERD-snipe!
[09:16:42] <wikibugs>	 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10CommanderWaterford) https://quarry.wmflabs.org/query/53621 - no results displayed after running almost for an hour ....
[09:26:09] <elukey>	 created https://gerrit.wikimedia.org/r/c/operations/puppet/+/675088 for the labels :)
[09:49:17] <wikibugs>	 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10CommanderWaterford) Queries of the last 4 hours are not being executed.
[10:26:44] <wikibugs>	 (03PS1) 10Gilles: Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769)
[10:27:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles)
[10:31:32] <wikibugs>	 (03PS2) 10Gilles: Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769)
[10:32:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles)
[10:46:55] <wikibugs>	 (03PS3) 10Gilles: Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769)
[10:47:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles)
[10:51:58] <wikibugs>	 (03PS4) 10Gilles: Add cacheHost field to NavigationTiming schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769)
[11:08:49] <wikibugs>	 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Clean up issues with jobs after Hadoop Upgrade - https://phabricator.wikimedia.org/T274322 (10elukey)
[11:09:05] <elukey>	 ok I have applied a gpu label
[11:09:14] <elukey>	 but it is a little bit more convoluted than expected
[11:09:22] <elukey>	 in the yarn UI, I see all the queues duplicated
[11:09:30] <elukey>	 for the DEFAULT and GPU node partitions
[11:09:36] <joal>	 wow
[11:09:39] <joal>	 weird :S
[11:09:49] <elukey>	 and fifo in GPU seems having no resources
[11:16:32] <elukey>	 it seems that there are more gotchas, will follow up after lunch
[11:16:58] <elukey>	 it is also unfortunate that we cannot specify multiple labels for nodes
[11:17:25] <elukey>	 at least IIUC
[11:22:21] * elukey lunch 
[13:28:18] <elukey>	 back to the capacity scheduler :)
[13:28:33] <joal>	 heya elukey 
[13:28:52] <elukey>	 so adding a label == create a partition on cluster nodes, and all the queues defined are created in there too
[13:28:53] <joal>	 elukey: can you please launch a spark-shell from analytics-search user?
[13:29:32] <joal>	 elukey: plan is to try to move the app between queues from a non-admins user
[13:29:44] <joal>	 elukey: and see it fail
[13:30:18] <elukey>	 joal: this is something that we never tried, I have to create a specific kerberos keytab for hadoop test
[13:30:50] <joal>	 Arf
[13:30:52] <joal>	 hm
[13:30:58] <elukey>	 I'll do it in a bit, I am trying to understand why I cannot run a job in the fifo queue now
[13:31:16] <joal>	 :(
[13:31:44] <elukey>	 the new labels add an extra complexity layer
[13:31:51] <joal>	 I can imagine
[13:32:24] <joal>	 elukey: I'll give you 100% compute power in 1/2h (1-1 now)
[13:40:00] <wikibugs>	 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10EYener) Hi, just pinging back on this ticket. @mforns are the new gerrit review requests looking better?
[13:46:28] <wikibugs>	 10Quarry: Queries left in "running" state for hours - https://phabricator.wikimedia.org/T278544 (10GeoffreyT2000)
[13:48:50] <wikibugs>	 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10GeoffreyT2000) >>! In T264254#6948098, @CommanderWaterford wrote: > https://quarry.wmflabs.org/query/53621 - no results displayed after running almost fo...
[13:57:00] <mforns>	 hey teammm
[13:57:50] <elukey>	 hola!
[14:14:55] <wikibugs>	 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10mforns) Hi @EYener, sorry for the delay, I've been a couple days off. Looking now
[14:27:59] <fdans>	 a-team wanna do some geoguessr in half an hour? someone else could drive today :)
[14:28:19] <mforns>	 :] +1
[14:29:39] <joal>	 Yes! I'll be late though
[14:38:03] <wikibugs>	 (03CR) 10Mforns: "This looks way better! Thanks @EYener." (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/673164 (owner: 10Erin Yener)
[14:43:35] <wikibugs>	 (03CR) 10Mforns: "Looks great! Thanks." (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/673161 (owner: 10Erin Yener)
[14:44:53] <wikibugs>	 (03CR) 10Mholloway: "Thanks. The remaining long sentences look fine to me. Just a couple more small things and then I think this will be ready to go." (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 (owner: 10Sharvaniharan)
[14:47:38] <fdans>	 lexnasser: you're more than welcome to join, I added you to the calendar invite :)
[14:48:35] <wikibugs>	 (03CR) 10Mforns: "Looks great!" (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/673157 (owner: 10Erin Yener)
[14:50:45] <wikibugs>	 10Analytics: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot - https://phabricator.wikimedia.org/T278551 (10Isaac)
[14:51:11] <wikibugs>	 10Analytics: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot - https://phabricator.wikimedia.org/T278551 (10Isaac) (apologies to anyone who got added to this task because I accidentally selected `Analytics` as a subscriber not a tag)
[14:51:17] <wikibugs>	 10Analytics, 10FR-Tech-Analytics, 10Fundraising-Backlog: Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage - https://phabricator.wikimedia.org/T273246 (10mforns) @EYener, I reviewed the changes. They look a lot better, thanks. The only changes needed are very minor. Please, have...
[16:13:12] <wikibugs>	 10Analytics, 10Analytics-Kanban: Memory errors in Spark - https://phabricator.wikimedia.org/T278441 (10Isaac) Quick update -- my dream of being able to simultaneously process the wikitext of every current version of a Wikipedia article in a single job still doesn't work (though they now at least fail due to ac...
[16:16:50] <elukey>	 joal: so for some reason, the config for the queue in the GPU partition don't get any resource, and all my attempts to start a spark-shell get stuck in ACCEPTED state
[16:17:00] <elukey>	 tried a lot of confs but I have no idea what's wrong
[16:25:19] <suriname0>	 Hey a-team, I'm curious about the status of T277609 (https://phabricator.wikimedia.org/T277609). I'm seeking a data export from an HDFS table used for storing ORES predictions.  Does anyone know if the a-team Hadoop instance is accessible from Toolforge/Cloud VPS?
[16:25:20] <stashbot>	 T277609: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609
[16:28:00] <elukey>	 suriname0: Hi! It is not, we require special access to it due to sensitive data - https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request?
[16:30:35] <suriname0>	 Thanks! Looks like a formal collaboration with the WMF research team is required to get this access myself. Assuming that's not a quick process, do you have a rec for moving forward with this ticket?  Or would it literally be faster to make a formal collaboration request and get that approved than to get an export of the needed (non-sensitive) data?
[16:32:13] <suriname0>	 In fact, I now see that page explicitly says "If you have a one-time need for data, request the data from the Analytics team instead."  Is there a general timeline for data requests?  It's been more than 10 business days, and trying to determine if I should make a contingency plan....
[16:35:01] <elukey>	 suriname0: I think that we don't have any idea about what is the plan beside what Aaron wrote into the task, we have limited resources and we considered this something like "it would be great if", not "this is blocking people"
[16:35:36] <suriname0>	 Ah, gotcha.  Well, this is blocking development on a toolforge tool, but I can't imagine that's very high priority.  
[16:35:48] <suriname0>	 I can see if Aaron can be more explicit about the location of the HDFS table?
[16:36:20] <elukey>	 suriname0: yeah and also please add the use case that you mentioned (with more details), so we can get a sense of what needs to be done and how to prioritize it
[16:36:53] <suriname0>	 Got it. I can add the explicit technical detail and link to the relevant research page from the phab ticket.
[16:37:24] <elukey>	 super thanks, and if Aaron could add more details about what is the dataset dump needed it would be great
[16:38:27] <elukey>	 IIUC the dataset contains non-PII data (so only ORES predictions), but before making it public somebody on our side will have to vet the data etc..
[16:40:54] <suriname0>	 Yup. We are looking for a super simple export: (request/prediction_timestamp, rev_id, ores_damaging_prediction, ores_goodfaith_prediction).  The problem is that it's incredibly challenging to reconstruct this data from previous versions of the ORES model. Anyway, will write this up and add it to the ticket.
[16:42:12] <elukey>	 perfect, with those details we might be able to work on it sooner :)
[16:47:26] <wikibugs>	 (03PS18) 10Sharvaniharan: Image recommendations table for android [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244
[16:48:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Image recommendations table for android [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 (owner: 10Sharvaniharan)
[16:50:49] <wikibugs>	 (03PS19) 10Sharvaniharan: Image recommendations table for android [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244
[16:59:43] <wikibugs>	 (03CR) 10Sharvaniharan: Image recommendations table for android (032 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/668244 (owner: 10Sharvaniharan)
[17:14:50] <wikibugs>	 10Analytics, 10Patch-For-Review: Load Avro schemas from configurable external path - https://phabricator.wikimedia.org/T126501 (10Jdforrester-WMF) Is this task still valid? Found as part of {T278569}.
[17:14:59] <wikibugs>	 10Analytics-Radar, 10Discovery-Search, 10MediaWiki-General, 10MW-1.36-notes (1.36.0-wmf.37; 2021-03-30): Proposal: drop avro dependency from mediawiki - https://phabricator.wikimedia.org/T265967 (10Jdforrester-WMF) Follow-up: {T278569}.
[17:20:34] <wikibugs>	 10Analytics: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot - https://phabricator.wikimedia.org/T278551 (10JAllemandou) Hi @Isaac - Thanks for raising this. The problem comes from some projects having both the entire wiki in a single xml file, and split by pages. Our import is not resilient...
[17:20:41] <wikibugs>	 10Analytics: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot - https://phabricator.wikimedia.org/T278551 (10JAllemandou) a:03JAllemandou
[17:49:59] <wikibugs>	 10Analytics: Duplicate wikitext entries for a bunch of wikis in 2021-02 snapshot - https://phabricator.wikimedia.org/T278551 (10Isaac) > The problem comes from some projects having both the entire wiki in a single xml file, and split by pages. Ahh...that makes a lot more sense and explains why it seemed to be co...
[17:57:02] <isaacj>	 suriname0: you might be able to get part of what you want from the ORES database tables (https://www.mediawiki.org/wiki/Extension:ORES/Schema). they should be accessible via Toolforge -- e.g., example query for extracting info about ORES articlequality predictions: https://quarry.wmflabs.org/query/52350 I have no idea how complete they are though
[18:05:15] <suriname0>	 ty isaacj, I'll check with Aaron to ask how up-to-date or if that table is updated
[18:07:56] <hnowlan>	 aqs101[0-5] now have mostly accurate mirrors of all of the "small" tables (ie not local_group_default_T_pageviews_per_article_flat and local_group_default_T_mediarequest_per_file - figured they'd be a bit big for a friday)
[18:08:08] <elukey>	 hnowlan: nice!
[18:08:47] <hnowlan>	 the aqs service itself is busted because of role issues which will get fixed monday and then it should be ready for some functional testing, probably try importing those massive tables with some advance warning for traffic too
[18:09:14] <elukey>	 +1
[18:09:38] <elukey>	 hnowlan: how did you import the tables at the end? 
[18:09:46] <elukey>	 (I am curious about the command and procedure etc..)
[18:09:50] <elukey>	 (never done it)
[18:12:05] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "+1, although we'd usually do this as a minor version increment, not a bugfix/patch increment." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/675105 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles)
[18:13:02] <wikibugs>	 10Analytics, 10Patch-For-Review: Load Avro schemas from configurable external path - https://phabricator.wikimedia.org/T126501 (10Ottomata) 05Open→03Invalid Nope!  closing thank you.
[18:16:01] <hnowlan>	 elukey: transfer.py to ship the data and sstableloader on the files themselves - I did a 1:1 copy from old hosts to new hosts so the rack ordering is the same etc. I didn't do everything at the same time though so there'll be inaccuracies between tables if hourly imports kicked in 
[18:16:27] <wikibugs>	 10Quarry: Quarry queries forever stuck in queue - https://phabricator.wikimedia.org/T274071 (10Bdijkstra) Same symptom also reported as [[ https://phabricator.wikimedia.org/T278544 | T278544 ]].
[18:18:41] <elukey>	 hnowlan: ah nice, neat!
[18:18:42] <hnowlan>	 as I understand it (and I could be wrong) given that we're using NetworkTopologyStrategy there'll be one replica in each rack - but because we have 4 nodes per rack we'll need to do a 1:1 copy to ensure that we get all replicas 
[18:18:52] <hnowlan>	 I hope I'm wrong on that though 
[18:22:15] <elukey>	 joal: found the issue!! it wooooorrrkkkkkkksssssssssss!!
[18:23:08] <elukey>	 hnowlan: so copying all replicas say in rack1 to the new cluster (4 in total), and then let Cassandra to replicate?
[18:24:04] <wikibugs>	 10Analytics, 10Analytics-SWAP: Users should be able to read their jupyter instance logs - https://phabricator.wikimedia.org/T198764 (10Ottomata) @elukey @razzi @MoritzMuehlenhoff, do you think we could add all analytics-privatedata-users to the systemd-journal group?  https://serverfault.com/questions/681632/s...
[18:24:09] <wikibugs>	 10Analytics, 10Analytics-SWAP: Users should be able to read their jupyter instance logs - https://phabricator.wikimedia.org/T198764 (10Ottomata) a:03Ottomata
[18:25:04] <hnowlan>	 elukey: all replicas from all racks unfortunately - sstableloader redistributes data on loading across the new cluster once it discovers the ring, so we essentially get the keys redistributed. That won't cause us any practical issues post-import (I think) 
[18:25:43] <elukey>	 hnowlan: ah snap it is unfortunate
[18:26:32] <wikibugs>	 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10Aklapper) @CommanderWaterford: Hi, this task is about preparing for multi-instance replicas. For separate / other problems [please report separate issues...
[18:26:52] <hnowlan>	 I might get Eric to check my work though, it'd be great if my assumptions are incorrect :D 
[18:27:27] <elukey>	 hnowlan: +1 yes! 
[18:27:46] <elukey>	 all right going to log off for the weekend, thanks a lot for the work hnowlan!
[18:27:52] <elukey>	 have a good weekend folks!
[18:28:02] <hnowlan>	 gonna do the same, have a good one! 
[18:28:53] <wikibugs>	 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, 10Goal, and 3 others: Modern Event Platform - https://phabricator.wikimedia.org/T185233 (10Ottomata)
[18:44:32] <wikibugs>	 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10CommanderWaterford) I do know very well and since there is no more stability since you are trying to prepare Quarry for this... Quarry is not running now...
[18:52:42] <wikibugs>	 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10Aklapper) @CommanderWaterford: If you know very well then please do not intentionally ignore. Also see https://www.mediawiki.org/wiki/Bug_management/Phab...
[19:04:44] <wikibugs>	 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10Bstorm) Shoot, I suspect I might have caused that. Fixing.
[19:05:15] <wikibugs>	 10Quarry: Queries left in "running" state for hours - https://phabricator.wikimedia.org/T278544 (10Bstorm) a:03Bstorm
[19:17:09] <wikibugs>	 (03PS1) 10Bstorm: connection handling: correct closing of connections [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/675203 (https://phabricator.wikimedia.org/T278544)
[19:18:33] <wikibugs>	 10Quarry, 10Patch-For-Review: Queries left in "running" state for hours - https://phabricator.wikimedia.org/T278544 (10Bstorm) Basically, I added a cleanup of the connection attribute the other day without some important safeguards, and I think this is killing workers.  ` Mar 26 18:06:46 quarry-worker-02 celer...
[19:24:38] <wikibugs>	 (03PS7) 10Ottomata: Add support for finding RefineTarget inputs from Hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/673604 (https://phabricator.wikimedia.org/T212451)
[19:25:10] <wikibugs>	 (03CR) 10Bstorm: "Merging to stop the bleeding, since right now this is crashing from closing already-closed connections." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/675203 (https://phabricator.wikimedia.org/T278544) (owner: 10Bstorm)
[19:25:17] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] connection handling: correct closing of connections [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/675203 (https://phabricator.wikimedia.org/T278544) (owner: 10Bstorm)
[19:25:47] <wikibugs>	 (03Merged) 10jenkins-bot: connection handling: correct closing of connections [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/675203 (https://phabricator.wikimedia.org/T278544) (owner: 10Bstorm)
[19:29:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add support for finding RefineTarget inputs from Hive [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/673604 (https://phabricator.wikimedia.org/T212451) (owner: 10Ottomata)
[19:30:21] <wikibugs>	 10Quarry, 10Patch-For-Review: Queries left in "running" state for hours - https://phabricator.wikimedia.org/T278544 (10Bstorm) https://quarry.wmflabs.org/query/53634 completed, and I don't see any workers killed this time.
[19:37:46] <wikibugs>	 (03CR) 10Ottomata: Improve Refine failure report email (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/674304 (owner: 10Ottomata)
[19:37:49] <wikibugs>	 (03PS7) 10Ottomata: Improve Refine failure report email [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/674304
[19:39:20] <wikibugs>	 10Quarry, 10Patch-For-Review: Queries left in "running" state for hours - https://phabricator.wikimedia.org/T278544 (10Bstorm) 05Open→03Resolved I think I stopped the workers from dying. The code should do a better job of cleaning up connections without trying to close already-closed connections now.  Plea...
[19:40:50] <wikibugs>	 10Quarry: Quarry queries forever stuck in queue - https://phabricator.wikimedia.org/T274071 (10Bstorm) It's not really a queue per-se here. The worker died if it is trying to run more than 4 hours. At 4 hours the database server will kill the process (which a worker can detect). We need the web to detect worker...
[19:51:06] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583 (10Bstorm) p:05Triage→03Medium
[19:52:13] <wikibugs>	 10Quarry: Quarry queries forever stuck in queue - https://phabricator.wikimedia.org/T274071 (10Bstorm)
[19:52:16] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583 (10Bstorm)
[19:53:52] <wikibugs>	 10Quarry: Update status when internal error occurs in worker - https://phabricator.wikimedia.org/T209000 (10Bstorm)
[19:54:09] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583 (10Bstorm)
[19:54:13] <wikibugs>	 10Quarry: Gigantic query results cause a SIGKILL and the query status do not update - https://phabricator.wikimedia.org/T172086 (10Bstorm)
[19:54:16] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583 (10Bstorm)
[19:55:59] <wikibugs>	 10Quarry: Explain command forces Quarry to keep running endlessly - https://phabricator.wikimedia.org/T155808 (10Bstorm) Explain is almost certainly broken by {T264254}   The functionality will need rewriting or similar. It stopped working reliably long before that though because of load balancing on the replicas.
[19:56:21] <wikibugs>	 10Quarry: Wrong status of queries in Recent Queries list - https://phabricator.wikimedia.org/T137517 (10Bstorm)
[19:56:24] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583 (10Bstorm)
[19:56:56] <wikibugs>	 10Quarry: Quarry task running for a while - https://phabricator.wikimedia.org/T133738 (10Bstorm)
[19:57:18] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583 (10Bstorm)
[19:57:55] <wikibugs>	 10Quarry: Time limit on quarry queries - https://phabricator.wikimedia.org/T111779 (10Bstorm)
[19:58:19] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583 (10Bstorm)
[19:58:41] <wikibugs>	 10Quarry: Quarry cannot store results with identical column names - https://phabricator.wikimedia.org/T170464 (10Bstorm)
[19:59:17] <wikibugs>	 10Quarry, 10cloud-services-team (Kanban): Quarry should detect a dead worker and report something better than "running" forever - https://phabricator.wikimedia.org/T278583 (10Bstorm)
[20:00:46] <wikibugs>	 10Quarry: "Query Killed" time is ridiculously low - https://phabricator.wikimedia.org/T71071 (10Bstorm) 05Open→03Invalid This is rather ancient and could not possibly be true now.
[20:19:45] <wikibugs>	 10Quarry, 10Patch-For-Review: EXPLAIN is broken because new analytics wiki replica cluster contains multiple servers - https://phabricator.wikimedia.org/T205214 (10Bstorm) It is more broken now due to {T264254}
[22:43:09] <wikibugs>	 10Analytics, 10Data-Services, 10Machine-Learning-Team, 10ORES, and 2 others: Generate dump of scored-revisions from 2018-2020 for English Wikipedia - https://phabricator.wikimedia.org/T277609 (10Suriname0) Thanks to @elukey for chatting with me in the IRC and prompting me to provide additional info.  @Half...
[22:44:46] <wikibugs>	 10Quarry, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare Quarry for multiinstance wiki replicas - https://phabricator.wikimedia.org/T264254 (10Bstorm) Yup. A bug that I surfaced yesterday by improving cleanup code caused workers to die under some conditions (besides what they always have had as...