[00:14:55] 10Quarry: Add a possibility to delete a draft - https://phabricator.wikimedia.org/T135908 (10Krinkle) The drafts can already be renamed, and their description and query can already be overwritten and resubmitted to effectively blank them permanently. As such, the ability to delete a draft wouldn't lose anything... [00:25:23] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) p:05Triage>03Unbreak! [00:53:18] so team I opened ---^ [00:53:29] I still haven't figured out what's happening yet [00:53:32] will restart tomorrow [02:16:58] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:47:08] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1005 is OK: OK [03:40:18] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 25.00% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [04:22:47] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [05:22:17] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 22.22% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [06:04:28] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [08:00:27] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 23.08% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [08:02:47] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [08:15:39] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) Current state is: * all processors down except the 11th (to work only on one) * logging debug enabled for eventlogging, kafka-python and confluent-kafka-python (manually hacking eventlogging's py... [08:24:11] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) Spot checking with kafkacat on various topics (eventlogging_$something) I can see that the last timestamp registered is `"2018-07-28T15:44:54Z"` (exactly when the mess happened). More interestingl... [08:33:31] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) So now I am wondering: if the client-side-events consumer, that should use kafka-python to consume, is not stuck but the processors are, does this mean that the issue is on the confluent-kafka-pyt... [08:34:46] 10Quarry: Add a possibility to delete a draft - https://phabricator.wikimedia.org/T135908 (10Dvorapa) But who wants to have a list full of empty drafts ([[https://quarry.wmflabs.org/Dvorapa|like me]]) [08:38:50] 10Quarry: Query runs over 5 hours without being killed - https://phabricator.wikimedia.org/T139162 (10Dvorapa) 05Open>03Resolved I have not seen this for a long time [08:56:03] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) I installed `python-dbg` and then I was able to use the awesome `thread apply all py-bt` function in gdb, that returned: ``` (gdb) thread apply all py-bt Thread 18 (Thread 0x7f406e435700 (LWP 20... [09:12:24] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) Tested the following, seems not completing: ``` >>> from ua_parser import user_agent_parser >>> ua_string = 'Mozilla/5.0 (X11; Linux x86_64_128) AppleWebKit/11111111111111111111111111111111111111... [09:31:37] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 25.00% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [10:37:05] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) A even simpler repro case that me and @Volans found is the following: ``` import re ua_string = 'Mozilla/5.0 (X11; Linux x86_64_128) AppleWebKit/1111111111111111111111111111111111111111111111111... [11:18:47] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [11:27:48] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [11:29:57] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10Volans) So, after digging a bit, the culprit is the regex that @elukey has pasted above: `^(.*)/(\\d+)\\.?(\\d+)?.?(\\d+)?.?(\\d+)? CFNetwork`, that comes from https://github.com/ua-parser/uap-core/blob/m... [11:37:58] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10Volans) On a side note, when parsing the very long UA example from @elukey, there is one regex that is pretty slow, and it's https://github.com/ua-parser/uap-core/blob/master/regexes.yaml#L98 ``` ((?:[A-... [14:23:53] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10Nuria) In order to unnlock we could modify parsing code to do like we do in pageview processing, if ua is over a certain size we just skip it completely https://github.com/wikimedia/analytics-refinery-sou... [14:44:37] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 88.89% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [15:02:03] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) >>! In T200630#4460147, @Nuria wrote: > In order to unnlock we could modify parsing code to do like we do in pageview processing, if ua is over a certain size we just skip it completely https://gi... [15:08:31] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10Nuria) Length of UA at fault is 1000 chars and regular UAs are about 200 chars, how about not running through UA parser anything bigger than 4 times the regular limit that is 800 chars? [15:39:07] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [15:54:48] 10Analytics-Kanban: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 (10elukey) >>! In T200630#4460196, @Nuria wrote: > Length of UA at fault is 1000 chars and regular UAs are about 200 chars, how about not running through UA parser anything bigger than 4 times the regular... [16:04:33] (03PS2) 10Sahil505: Hide overlay when the cursor is no longer on top of graph [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/448852 (https://phabricator.wikimedia.org/T192416) [16:09:13] (03CR) 10Sahil505: [C: 031] "`d3.event.stopPropagation()` was preventing the overlay to hide when the cursor was no longer on top of the graph." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/448852 (https://phabricator.wikimedia.org/T192416) (owner: 10Sahil505) [16:30:25] (03PS1) 10Sahil505: Made Bar-chart popup dynamic [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/449029 (https://phabricator.wikimedia.org/T192416) [17:12:09] (03CR) 10Sahil505: [C: 04-1] "WIP" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/449029 (https://phabricator.wikimedia.org/T192416) (owner: 10Sahil505) [17:23:28] PROBLEM - EventLogging overall insertion rate from MySQL consumer on graphite1001 is CRITICAL: CRITICAL: 22.22% of data under the critical threshold [10.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [18:16:57] RECOVERY - EventLogging overall insertion rate from MySQL consumer on graphite1001 is OK: OK: Less than 20.00% under the threshold [50.0] https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=12&fullscreen&orgId=1 [21:04:31] (03PS2) 10Sahil505: Made Bar-chart popup dynamic [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/449029 (https://phabricator.wikimedia.org/T192416) [21:05:47] (03CR) 10Sahil505: [C: 04-1] "WIP" [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/449029 (https://phabricator.wikimedia.org/T192416) (owner: 10Sahil505)