[00:22:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [00:31:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on einsteinium is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [00:57:56] hm [01:02:19] !log bouncing main-eqiad -> jumbo-eqiad mirror maker [01:02:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [01:04:51] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message produce rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 2.005e+04 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:05:11] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad average message consume rate in last 30m on einsteinium is OK: (C)0 le (W)100 le 2.103e+04 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:09:51] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is CRITICAL: 5.934e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:14:59] 10Analytics, 10Services: Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics - https://phabricator.wikimedia.org/T196032#4244806 (10Ottomata) [01:23:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on einsteinium is OK: (C)1e+05 gt (W)1e+04 gt 159 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:30:42] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is CRITICAL: 2.35e+04 gt 1000 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [01:32:52] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is OK: (C)1000 gt (W)100 gt 0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [05:33:23] !log re-run faied webrequest-load upload|misc jobs via Hue [05:33:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [05:59:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4245017 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['druid1005.eqiad.wmnet']... [06:00:06] I am reimaging druid1005 to debian stretch :) [06:13:51] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4245042 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['druid1005.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['druid1005.eq... [06:17:24] (03PS5) 10Joal: Update mediawiki-history stats [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/434987 (https://phabricator.wikimedia.org/T192481) [06:20:03] so I took a look to oozie's info for the webrequest-load failures [06:20:12] and they seem all related to generate_sequence_statistics [06:20:19] checking mapred jobs I can see [06:20:20] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 9.1 GB of 2.1 GB virtual memory used. Killing container. [06:20:38] have to check all though [06:21:03] the thing that I don't get, if this is true, is why it happens only once in a while [06:35:54] 10Analytics, 10Analytics-EventLogging, 10Readers-Web-Backlog: Explore an API for logging events sampled by session - https://phabricator.wikimedia.org/T168380#4245057 (10Nuria) El will be simplified and from that moment onwards there should not be any performance issues depending on it: https://phabricator.w... [06:39:21] druid1005 back in service [07:04:19] Hi elukey - I have seen the same errors, and followed the same tought process [07:09:19] Thanks elukey for teh druid reimage :) [07:11:35] joal: morning! how is it going?? [07:12:05] elukey: fine! Kinda tired of having the 2 kids for me alone, but that is life :) [07:12:17] :) [07:12:37] Today again, then should be back to normal (except that I suspect Friday will be as today ... But we'll see) [07:12:47] elukey: How about you? [07:12:59] let me know if I can help taking over some things in here :) [07:13:03] all good! [07:13:30] Thanks for asking elukey - I'd actually like to be able to help more ... This webrequest regular failures doesn't suit me [07:14:06] elukey: I think I have it [07:14:20] ahhaha 1m resolution time [07:14:24] elukey: https://hue.wikimedia.org/oozie/list_oozie_bundle/0014836-180510140726946-oozie-oozi-B [07:14:40] --> configuration tab --> oozie_launcher_memory [07:15:02] elukey: we already experienced that, and I thought it had ben fixed :( [07:17:00] do you think that this is it? I thought it was a container spawned by oozie crossing its limits (so not obeying to oozie_launcher_memory but something else) [07:19:16] elukey: As the Black Eyed Peas would say, `I got a feeling` ... https://www.youtube.com/watch?v=uSD4vsh1zDA [07:19:40] elukey: any of the other workflow I double check have 2048 [07:21:58] ahahhahaha [07:25:03] elukey: I'm gonna restart the bundle now, letting the oozie_launcher_memory value to default [07:25:06] and see [07:26:19] super [07:26:35] so just to understand - where is this value set? As -D option when you start the coordinator? [07:27:02] !log Restart webrequest-load-bundle with default oozie_launcher_memory value (should be 2048 set by workflows) [07:27:03] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:27:19] elukey: We set it at 2048 in coordinators by default [07:27:41] elukey: However if you set it using the -D flag, you override the coord config [07:28:32] elukey: https://hue.wikimedia.org/oozie/list_oozie_bundle/0027733-180510140726946-oozie-oozi-B [07:28:35] configuratio [07:28:47] --> no oozie_launcher_memory set [07:28:53] I hope it'll be better [07:31:11] ah so it got set by mistake during one of the last oozie coordinator kill/respawn? [07:31:18] elukey: I think it did [07:31:21] so now it is not set anymore to 256 and it picks up 2048 [07:31:24] that is the default [07:31:41] elukey: we need to double check, but I expect so [07:32:27] ack [07:35:05] joal: I can see two webrequest-load bundles in hue though [07:35:21] elukey: I'm wating for the last job of the previous bundle to finish [07:35:26] ahhhh [07:35:42] okok I was confused, makes sense [07:36:10] elukey: trying to prevent rerunning [07:36:15] :) [07:36:32] in the meantime, druid1005 seems to have restored its segments cache (no more underreplicated segments) [07:37:23] elukey: interesting - UI tells me d1005 is loaded at 8.3%, while the other 2 are 14.9% [07:37:28] hm - Interesting! [07:39:11] the important bit (I think) it that segments are replicated as expected (and none are unavail) buuut I am pretty sure that druid needs to rebalance a bit after a node wipe before getting to a stable state [07:39:19] and it takes a bit of time, it is not that fast [07:39:30] (I am waiting a day between each reimage to be sure) [07:39:52] elukey: makes sense - It's fast to answer queries - I can live with it being not that fast to rebalance :) [07:40:53] yep yep! And I think it is done on purpose to avoid a massive storm of data shifting between nodes on big clusters [07:41:30] I was reading that there are clusters of hundreds of druid nodes out there :D [07:41:36] :) [07:41:38] I can't imagine doing a rolling upgrade [07:41:49] elukey: I think criteo has a massive one [07:42:12] one thing that I still need to verify is if they fixed the problem of query failures if one subquery to a historical fails [07:42:21] I've read an article about it from 2017 [07:42:37] in which they were saying that this is/was a big problem [07:42:58] elukey: I think it's explicitely written that `druid expects historical to answer` [07:43:10] But maybe it has changed [07:43:41] sure, but with replication > 1 I'd expect that the sub-query is retried at least once before giving up [07:43:50] Makes sense elukey !! [07:44:04] or maybe tunable with setting [07:44:09] I may want this behavior or not [07:44:25] but it sucks if doing maintenance affects wikistats [07:44:28] elukey: however this also implies potential slower response time, with all the related issues [07:44:31] and there is not way to prevent it [07:44:51] I agree, this is why I was suggesting about a tunable [07:45:05] elukey: As with HDFS about updating files, if it brings too many problems, don't do it and let the client handle [07:45:11] :) [07:45:23] The famous `Not my problem` way [07:46:04] elukey: I see d1005 continuing to rebalance - Sounds good :) [07:46:24] 0.11 is way more stable with zk conns too [07:46:26] I like it [07:46:45] * joal likes when elukey likes it [08:33:19] * elukey goes afk for ~30m! [09:11:53] joal: hola! [09:12:03] Hi nuria_ [09:12:05] joal: the pageviews-daily set: https://turnilo.wikimedia.org/#pageviews-daily/ [09:12:35] joal: seems to stop after april 30th this year , am i seeing this right? [09:12:49] nuria_: We load them monthly [09:13:20] joal: ah wait we are IN May still!!! [09:13:32] joal: thank you thank you [09:13:38] :) [09:14:10] nuria_: I'm a bit sad, even with strong fight for MWH stats, I'm gonna end-up checking correctness using data [09:14:36] nuria_: I knew accumulator could be somehow unstable, but they actually more than expected [09:21:25] joal: let's look at that in detail next week [09:22:35] I'm currently making sure I have a working check to deploy soon, so that at least we can move on [09:24:24] joal: druid1005 now at ~11%, the other two 13% [09:24:35] elukey: you indeed were right :) [09:25:24] do you think that I can reimage druid1006 now [09:25:25] ? [09:25:31] seems feasible and not too invasive [09:25:38] +1 elukey [09:26:01] elukey: it'll force d1005 to catch up a bit quicker ;) [09:28:22] goood! [09:28:26] doing it now :) [09:34:43] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4245382 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['druid1006.eqiad.wmnet']... [09:34:51] started the reimage [09:35:03] after this both clusters will be on stretch! \o/ [09:35:13] \o/ [09:39:13] I just had an idea about varnishkafka's metrics [09:40:00] at the moment the way that we use to send metrics to graphite is via a cron that run logster [09:40:17] that reads a log file containing json stuff, and publishes metrics [09:40:35] now, if I write I simple Python script that does exactly that [09:40:59] with the difference that it reads the json file when polled [09:41:03] by prometheus [09:41:17] then we'd have a way to move vk to prometheus, ditching graphite [09:41:59] I need also to check what is the trick used by librdkafka to send metrics [09:49:42] elukey: can you confirm me the new druid1005 ECDSA key please? [09:54:28] done in pvt :) [10:01:28] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4245475 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['druid1006.eqiad.wmnet'] ``` and were **ALL** successful. [10:04:09] druid1006 back in service! [10:04:31] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade Druid nodes (1001->1006) to Debian Stretch - https://phabricator.wikimedia.org/T192636#4245480 (10elukey) [10:05:11] 10Analytics, 10Analytics-Kanban: Upgrade Analytics infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T192642#4245482 (10elukey) [10:29:27] (03CR) 10Nuria: "Found one typo, other than that looks good." (032 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/436080 (https://phabricator.wikimedia.org/T195882) (owner: 10Joal) [10:31:06] 10Analytics, 10Analytics-Kanban: Upgrade Analytics infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T192642#4245516 (10elukey) [10:33:43] 10Analytics, 10Analytics-Kanban: Problems with external referrals? - https://phabricator.wikimedia.org/T195880#4245531 (10Nuria) From @JAllemandou 's e-mail: I can't find change-reasons for this artifact: - We deployed code on the 10th (not the 9th), but for ode tha doesn't impact referer_class - We did c... [10:36:21] PROBLEM - pivot on thorium is CRITICAL: connect to address 10.64.53.26 and port 9090: Connection refused [10:37:27] ah! [10:37:47] my bad, I stopped it but it is still there [10:37:50] wiping it [10:46:13] bye bye Pivot [10:50:06] :) [10:51:25] !log stopped Pivot on thorium [10:51:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:51:48] 10Analytics, 10Analytics-Kanban: Problems with external referrals? - https://phabricator.wikimedia.org/T195880#4245558 (10Nuria) I am not sure if I understand the comments around "correlation" but what is happening arround late march is the lauch of a new version of Safari that supports our meta tag arround re... [10:52:06] 10Analytics, 10Analytics-Kanban: Problems with external referrals? - https://phabricator.wikimedia.org/T195880#4245559 (10Nuria) [10:56:14] so I checked our hw refresh/expansion plan [10:56:32] we should review what do to next month/quarter probably [11:56:16] hellooo joal yt? [12:15:56] 10Analytics, 10Discovery-Search, 10MediaWiki-JobQueue, 10Services (designing): Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics - https://phabricator.wikimedia.org/T196032#4245748 (10mobrovac) [12:16:11] 10Analytics, 10Discovery-Search, 10EventBus, 10MediaWiki-JobQueue, 10Services (designing): Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics - https://phabricator.wikimedia.org/T196032#4244806 (10mobrovac) [12:21:19] 10Analytics, 10EventBus, 10ORES, 10Scoring-platform-team, and 2 others: Numeric keys in ORES models causing downstream Hive ingestion to fail - https://phabricator.wikimedia.org/T195979#4245755 (10mobrovac) [12:21:22] 10Analytics, 10Operations, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066#4245756 (10elukey) p:05Triage>03Normal [12:21:27] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Operations, 10Services (watching): Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067#4245769 (10Pchelolo) p:05Triage>03Normal [12:29:52] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Operations, and 2 others: Clean up cpjobqueue metrics - https://phabricator.wikimedia.org/T196067#4245785 (10mobrovac) [12:36:23] 10Analytics, 10EventBus, 10Wikimedia-Stream, 10Services (watching), 10User-mobrovac: Bikeshed what events should be exposed in public EventStreams API - https://phabricator.wikimedia.org/T149736#4245791 (10Pchelolo) [12:36:26] 10Analytics, 10ChangeProp, 10Collaboration-Team-Triage, 10Edit-Review-Improvements-ReviewStream, and 4 others: Set up the foundation for the ReviewStream feed - https://phabricator.wikimedia.org/T143743#4245792 (10Pchelolo) [12:36:30] 10Analytics, 10Analytics-Kanban, 10EventBus, 10ORES, and 5 others: Emit revision-score event to EventBus and expose in EventStreams - https://phabricator.wikimedia.org/T167180#4245789 (10Pchelolo) 05Open>03Resolved I think this's done, the stream is exposed. Resolving. [12:37:04] 10Analytics, 10Operations, 10Traffic: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066#4245793 (10elukey) As reference, `prometheus::node_gdnsd` might be an example about how to proceed. [12:55:07] 10Analytics: Reimage thorium to Debian Stretch - https://phabricator.wikimedia.org/T192641#4245814 (10elukey) Thinking out loud :) During the last offsite we were wondering if our websites could have been served by more than one host, in order to be tolerant incase of failures. All the things on thorium as far... [12:59:48] 10Analytics, 10ChangeProp, 10EventBus, 10RESTBase-API, 10Services (doing): Re-render wiktionary definitions on user purges - https://phabricator.wikimedia.org/T192157#4129917 (10Pchelolo) I finally got to it: https://github.com/wikimedia/change-propagation/pull/273 [13:02:23] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10MW-1.32-release-notes (WMF-deploy-2018-06-05 (1.32.0-wmf.7)), and 2 others: Make JobExecutor debug-log to mwlog - https://phabricator.wikimedia.org/T195858#4245831 (10Pchelolo) [13:11:23] o/ [13:14:01] 10Analytics: Reimage thorium to Debian Stretch - https://phabricator.wikimedia.org/T192641#4245856 (10Ottomata) Hmm, good idea in general! The only issue is: https://analytics.wikimedia.org/datasets/ and also wikistats 1.0. Both need a lot of space. [13:17:24] 10Analytics, 10Discovery-Search, 10EventBus, 10MediaWiki-JobQueue, 10Services (designing): Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics - https://phabricator.wikimedia.org/T196032#4245861 (10Pchelolo) @Ottomata a theory: we've enabled compression in #EventBus but AFA... [13:23:33] ottomata: hellooo [13:24:01] whenever you have time I'd like to chat with you about hw refresh [13:24:47] 10Analytics, 10Discovery-Search, 10EventBus, 10MediaWiki-JobQueue, 10Services (designing): Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics - https://phabricator.wikimedia.org/T196032#4245883 (10Ottomata) Oo, good thought. I think you might be right. Will investigate an... [13:25:28] elukey: here now is fine! checking emails, etc... [13:25:31] or you want hangout? [13:26:20] as you prefer! [13:26:43] NO AS YOU PREFER [13:26:45] ahahahah [13:26:51] ok then we can do in here [13:27:03] so first nodes are analytics100[1,2] [13:27:12] warranty expired ~1y ago [13:27:23] maaan already [13:28:03] then analytics1003, but we'd need to figure out if we want to implement a sort of failover for it [13:28:07] (at least for the db) [13:28:59] then there are all the expansions for Hadoop/Druid [13:29:10] (also analytics1028->41 are OOW) [13:29:21] (but I'll keep them as much as possible) [13:29:36] (err I'd keep them) [13:29:46] and stat1007 [13:29:59] (research only box with the stat1005's gpu) [13:30:44] oh ya.... [13:31:41] ok elukey...:) [13:32:22] I also had a crazy idea about thorium, it is in one of your phab notifications probably, I know you will not like it :D [13:32:27] buuut I am going to try anyway [13:33:40] ottomata: let me know when you have a minute [13:37:11] i saw that, i responded to that one :) [13:37:18] nuria_: yes! oh let me heck that thing... [13:37:25] check* [13:39:07] what's up nuria_ ? [13:39:36] ottomata: how can i know where do errors in the refine process come from? [13:39:43] ottomata: i the crontab i see [13:40:11] ottomata: /usr/local/bin/refine_eventlogging_analytics >> /var/log/refinery/refine_eventlogging_analytics.log 2>&1 [13:41:11] ottomata: but that is the process log [13:41:59] ottomata: and it does not have any parsing errors [13:42:30] nuria_: sudo -u hdfs yarn logs -applicationId [13:42:40] but also, the email kinda makes it clear what is happening [13:42:46] 10Analytics, 10DC-Ops: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4245913 (10elukey) [13:42:53] 10Analytics, 10DC-Ops, 10Operations, 10procurement: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4245932 (10elukey) [13:43:07] The following 1 of 1 dataset partitions for table `event`.`mediawiki_revision_score` failed refinement: [13:43:07] ... FAILED: ParseException line 10:98 cannot recognize input near '0' ':' 'double' in column specification [13:43:17] it is trying to create a column named `0` [13:44:01] ottomata: ahem.. [13:45:29] see: https://phabricator.wikimedia.org/T195979 [13:45:50] 10Analytics: Reimage thorium to Debian Stretch - https://phabricator.wikimedia.org/T192641#4245937 (10elukey) Good points.. In theory wikistats 1.0 should go away soon right? The datasets are indeed a problem, I'll try to think about a solution :) [13:46:31] ottomata: ya, saw that , that is why i was asking [13:46:38] 10Analytics: Reimage thorium to Debian Stretch - https://phabricator.wikimedia.org/T192641#4245938 (10Ottomata) I don't think wikistats 1.0 will ever go away, will it? Erik might stop updating it, but I think it will stay online forever. [13:53:05] heya teaam [13:53:12] holaaa [13:53:23] ottomata: but after to see data we need to look at topic no? [13:53:28] ottomata: like kafkacat -b kafka1012.eqiad.wmnet:9092 -t codfw.mediawiki.revision-score ? [13:53:39] nuria_: yes and no [13:53:41] you coudl do that [13:53:45] the data is in hadoop too [13:53:48] just raw unrefined [13:53:50] no hive table [13:53:56] camus is still importing it [13:53:58] ottomata: ah [13:54:01] ottomata: rightttt [13:54:11] could do [13:54:12] https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop_Raw_Data [13:54:19] spark probably easier [13:55:14] ottomata: last thing - can I create procurement tasks for analytics100[12] and stat1007 today/tomorrow? We can discuss later on as post stand up the others.. I'll CC you as always so you'll be able to comment :) [13:57:30] yeah! please do! [13:57:42] i have no idea what is needed for stat1007 [14:01:44] super :) [14:01:50] going afk for ~1h, brb! [14:05:11] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10MediaWiki-extensions-ORES, and 3 others: ORESFetchScoreJob fails quite a lot - https://phabricator.wikimedia.org/T196076#4245990 (10Pchelolo) p:05Triage>03Normal [14:38:38] 10Analytics, 10EventBus: EventBus service can drop a few messages during kafka leadership change - https://phabricator.wikimedia.org/T196077#4246078 (10Ottomata) [14:39:43] 10Analytics, 10EventBus, 10Services (watching): EventBus service can drop a few messages during kafka leadership change - https://phabricator.wikimedia.org/T196077#4246090 (10Pchelolo) [14:57:58] (03CR) 10Nuria: "This works real well. Nice work." [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434500 (https://phabricator.wikimedia.org/T179444) (owner: 10Milimetric) [14:59:59] (03CR) 10Nuria: [V: 032 C: 032] Ignore search preferences [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433584 (owner: 10Milimetric) [15:02:29] (03CR) 10Nuria: [C: 032] Expand setState to be more explicit [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433585 (owner: 10Milimetric) [15:04:54] (03CR) 10Nuria: [C: 032] Move detail state into store [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433586 (owner: 10Milimetric) [15:05:41] mforns: milimetric 's changes for state are REAL nice cc fdans [15:06:01] nuria_, yes amazing :] [15:06:18] killer feature :) [15:07:18] (03Merged) 10jenkins-bot: Expand setState to be more explicit [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433585 (owner: 10Milimetric) [15:08:29] (03Merged) 10jenkins-bot: Move detail state into store [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/433586 (owner: 10Milimetric) [15:12:44] (03CR) 10VolkerE: [C: 04-1] "Few minor comments" (036 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) (owner: 10Sahil505) [15:14:40] 10Analytics, 10Operations, 10hardware-requests: Site: eqiad | Hardware refresh for analytics100[1,2] - https://phabricator.wikimedia.org/T196079#4246157 (10elukey) [15:16:15] 10Analytics, 10Operations, 10hardware-requests: Site: eqiad | Hardware refresh for analytics100[1,2] - https://phabricator.wikimedia.org/T196079#4246157 (10elukey) [15:25:56] 10Analytics, 10Operations, 10hardware-requests: Site: eqiad | hardware request for a dedicated stat analytics host for the Research team - https://phabricator.wikimedia.org/T196080#4246185 (10elukey) [15:26:06] lzia: o/ [15:26:35] I just opened a task to follow up on the dedicated host that we discussed in our last offsite [15:36:37] 10Analytics, 10Services: Enable TLS and authorization for cross DC MirrorMaker - https://phabricator.wikimedia.org/T196081#4246215 (10Ottomata) [15:44:36] (03PS3) 10Sahil505: Fixed accessibility/markup issues of Wikistats 2.0 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) [15:46:11] (03CR) 10Sahil505: "Thanks for the review, Volker :-]" (036 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) (owner: 10Sahil505) [15:50:47] (03CR) 10VolkerE: Fixed accessibility/markup issues of Wikistats 2.0 (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) (owner: 10Sahil505) [15:53:48] (03PS4) 10Sahil505: Fixed accessibility/markup issues of Wikistats 2.0 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) [15:54:14] (03CR) 10VolkerE: [C: 031] Fixed accessibility/markup issues of Wikistats 2.0 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) (owner: 10Sahil505) [15:54:33] (03CR) 10Sahil505: Fixed accessibility/markup issues of Wikistats 2.0 (031 comment) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/436002 (https://phabricator.wikimedia.org/T185533) (owner: 10Sahil505) [15:57:18] (03CR) 10Nuria: [C: 032] Reflect detail state in the URL and back (032 comments) [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434500 (https://phabricator.wikimedia.org/T179444) (owner: 10Milimetric) [16:01:04] (03PS1) 10Zhuyifei1999: Change database connection & table charset to 'utf8mb4' [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/436576 [16:01:14] ping joal [16:02:52] (03Merged) 10jenkins-bot: Reflect detail state in the URL and back [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/434500 (https://phabricator.wikimedia.org/T179444) (owner: 10Milimetric) [16:12:02] (03CR) 10Zhuyifei1999: [C: 04-1] "query_revision.text is BINARY. needs to dig further." [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/436576 (owner: 10Zhuyifei1999) [16:30:11] 10Analytics-Kanban, 10Patch-For-Review: Refresh SWAP notebook hardware - https://phabricator.wikimedia.org/T183145#4246349 (10RobH) [16:31:46] 10Analytics, 10Services: Enable TLS and authorization for cross DC MirrorMaker - https://phabricator.wikimedia.org/T196081#4246358 (10Ottomata) p:05Triage>03Normal [16:32:04] 10Analytics, 10Services: Enable TLS and authorization for cross DC MirrorMaker - https://phabricator.wikimedia.org/T196081#4246362 (10Nuria) p:05Normal>03High [16:32:58] 10Analytics, 10Services: Enable TLS and authorization for cross DC MirrorMaker - https://phabricator.wikimedia.org/T196081#4246215 (10Nuria) p:05High>03Normal [16:33:53] 10Analytics, 10Operations, 10hardware-requests: Site: eqiad | hardware request for a dedicated stat analytics host for the Research team - https://phabricator.wikimedia.org/T196080#4246388 (10Nuria) p:05Triage>03Normal [16:34:31] 10Analytics, 10Operations, 10hardware-requests: Site: eqiad | Hardware refresh for analytics100[1,2] - https://phabricator.wikimedia.org/T196079#4246394 (10Nuria) p:05Triage>03Normal [16:35:35] 10Analytics, 10EventBus, 10Services (watching): EventBus service can drop a few messages during kafka leadership change - https://phabricator.wikimedia.org/T196077#4246399 (10Ottomata) p:05Triage>03Normal [16:36:13] 10Analytics, 10EventBus, 10Services (watching): EventBus service can drop a few messages during kafka leadership change - https://phabricator.wikimedia.org/T196077#4246402 (10Nuria) p:05Normal>03Low [16:37:21] 10Analytics, 10DC-Ops, 10Operations, 10procurement: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4246405 (10Nuria) p:05Triage>03Normal [16:37:26] 10Analytics, 10DC-Ops, 10Operations, 10procurement: Analytics hosts missing in Inventory/Refresh - https://phabricator.wikimedia.org/T196072#4245913 (10Nuria) p:05Normal>03Low [16:37:44] 10Analytics, 10Discovery-Search, 10EventBus, 10MediaWiki-JobQueue, 10Services (designing): Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics - https://phabricator.wikimedia.org/T196032#4246414 (10Nuria) p:05Triage>03Normal [16:38:04] 10Analytics, 10Analytics-Kanban, 10Discovery-Search, 10EventBus, and 2 others: Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics - https://phabricator.wikimedia.org/T196032#4244806 (10Nuria) [16:38:52] 10Analytics, 10Analytics-Kanban: Problems with external referrals? - https://phabricator.wikimedia.org/T195880#4246440 (10Nuria) p:05Triage>03High [16:39:24] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10MediaWiki-extensions-ORES, and 3 others: ORESFetchScoreJob fails quite a lot - https://phabricator.wikimedia.org/T196076#4246449 (10Nuria) p:05Normal>03Triage [16:41:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Deploy Turnilo (possible pivot replacement) - https://phabricator.wikimedia.org/T194427#4246469 (10Nuria) [16:41:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Pageviews-daily broken after move from Pivot to Turnilo - https://phabricator.wikimedia.org/T195819#4246468 (10Nuria) 05Open>03Resolved [16:43:19] 10Analytics, 10Analytics-Wikistats, 10Domains, 10Operations, 10Traffic: HTTP 500 on stats.wikipedia.org (invalid domain) - https://phabricator.wikimedia.org/T195568#4231062 (10Nuria) Option b) sounds good. [16:46:29] 10Analytics, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Puppet: deployment-eventlog05 puppet error about missing mysql heartbeat.heartbeat table - https://phabricator.wikimedia.org/T191109#4093870 (10Nuria) a:03elukey [16:47:26] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Stream: Support timestamp based consumption in KafkaSSE and EventStreams - https://phabricator.wikimedia.org/T196009#4246480 (10Nuria) p:05Low>03High [16:52:46] 10Analytics, 10Analytics-Wikistats, 10Domains, 10Operations, 10Traffic: HTTP 404 on stats.wikipedia.org (Domain not served) - https://phabricator.wikimedia.org/T195568#4246485 (10Krinkle) [16:53:43] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#4246500 (10Krinkle) 05declined>03Open [16:54:13] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) [16:54:17] 10Analytics, 10Analytics-Wikistats, 10Domains, 10Operations, 10Traffic: HTTP 404 on stats.wikipedia.org (Domain not served) - https://phabricator.wikimedia.org/T195568#4231062 (10Krinkle) [16:54:25] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) [16:55:23] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Krinkle) >>! In T195568#4233129, @Dzahn wrote: > option a) delete stats record from the wi... [16:56:56] ottomata: Can you help me delete the MobileWikiAppiOSSessions table again? For some reason the type of `measure` field is still `double`... Thanks! [16:57:29] Just to be safe, can you help deleting MobileWikiAppiOSUserHistory, MobileWikiAppiOSLoginAction, MobileWikiAppiOSSettingAction, MobileWikiAppiOSReadingLists as well? [16:58:51] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Move EventStreams to new jumbo cluster. - https://phabricator.wikimedia.org/T185225#4246528 (10Ottomata) Here is a thought. We are currently running EventStreams backed by the analytics-eqiad Kafka cluster, instead of the main-* clusters, so that we coul... [16:58:59] hm, chelsyx ok [16:59:25] Thanks ottomata ! [17:01:40] !log dropping and deleting MobileWikiAppiOS* tables and data per request from chelsyx [17:01:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:02:44] done [17:03:20] Thx! [17:07:50] 10Analytics, 10Analytics-Kanban, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Huge messages in eqiad.mediawiki.job.cirrusSearchElasticaWrite (and other?) topics - https://phabricator.wikimedia.org/T196032#4246567 (10EBjune) [17:21:50] 10Analytics, 10Commons, 10EventBus, 10MediaWiki-JobQueue, and 4 others: Make gwtoolsetUploadMediafileJob JSON-serializable - https://phabricator.wikimedia.org/T192946#4154783 (10Ramsey-WMF) [18:29:52] * elukey off! [21:24:35] 10Analytics, 10Operations, 10Patch-For-Review: PuppetĀ admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#4247353 (10Ottomata) Another bump for my friends @akosiaris and @MoritzMuehlenhoff :) [22:35:46] 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: Move EventStreams to new jumbo cluster. - https://phabricator.wikimedia.org/T185225#4247489 (10Pchelolo) We actually have rate limiter implementation integrated into `service-runner` - it's based on DHT so it's cluster-wide and it's been tested pretty wel... [23:28:50] 10Analytics, 10EventBus, 10MassMessage, 10MediaWiki-JobQueue, and 2 others: Global mass message delivered on meta but not on other wikis? - https://phabricator.wikimedia.org/T195500#4247552 (10Liuxinyu970226)