[00:55:41] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Milimetric) Would 429s get through to webrequest? I know at some point restbase was sending the 429s in which case, yes, but I forget if the throttling got moved to varnish in which case it... [01:30:18] (03CR) 10Milimetric: "I have a new way idea about code clarity, basically a strategy for code comments. I blabbered inline but maybe we should just chat about " (035 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/497604 (https://phabricator.wikimedia.org/T218463) (owner: 10Joal) [01:31:54] (03CR) 10Milimetric: [C: 03+2] Correct names in mediawiki-history sql package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/498861 (owner: 10Joal) [01:36:09] (03CR) 10Milimetric: "minor thought on syntax" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/499914 (owner: 10Joal) [03:37:42] (03CR) 10Nuria: Update mw user-history timestamps (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/497604 (https://phabricator.wikimedia.org/T218463) (owner: 10Joal) [05:11:41] 10Analytics, 10Analytics-Wikistats, 10ORES, 10Scoring-platform-team: Discuss Wikistats integration for ORES - https://phabricator.wikimedia.org/T184479 (10Harej) p:05Triage→03Lowest [06:32:54] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10elukey) >>! In T219910#5080246, @Milimetric wrote: > Would 429s get through to webrequest? I know at some point restbase was sending the 429s in which case, yes, but I forget if the throttl... [07:03:46] moorning! [07:08:41] o/ [07:42:52] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) I think that we should move away from hacks done up to now and... [07:45:38] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) >>! In T148843#5080853, @elukey wrote: > * see if i... [08:38:09] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) @EBernhardson I think that the most pressing point now is to d... [08:45:27] 10Analytics-Kanban, 10Analytics-Wikistats: Add an option to export the current graph into image file - https://phabricator.wikimedia.org/T219969 (10Dvorapa) [08:52:03] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10fdans) Hi @Tbayer @ovasileva I've done a test load of the dataset that you can now see in turnilo, for January 15 and 16. You should be able to filt... [09:17:47] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) Trying to assess the data loss for webrequest upload. We have refined the hour 22 with a higher threshold, so I executed the following quer... [09:26:36] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) Now for text: ` 19:21 5180607 19:22 5160837 19:23 5214087 19:24 5200381 19:25 5174572 19:26 5226006 19:27 5192289 19:28 4843832 <=======... [09:37:49] 10Analytics, 10Analytics-Kanban, 10Wikimedia-Incident: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10elukey) Checking from what oozie's data loss email report: text 2019-04-01-19 -> 921 requests (0.0% of total) have incomplete records. 86479... [10:37:00] * elukey lunch! [11:29:29] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [11:30:36] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Miriam) Thanks @EBernhardson and all!!. Would a CNN finetuning task, u... [11:32:05] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [12:17:06] 10Analytics, 10Product-Analytics, 10MW-1.33-notes (1.33.0-wmf.21; 2019-03-12), 10Patch-For-Review: Standardize datetimes/timestamps in the Data Lake - https://phabricator.wikimedia.org/T212529 (10JAllemandou) > We prefer the ISO-8601 strings for serialization everywhere. I want to point the **serializati... [12:32:59] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10JAllemandou) I think the problem experienced yesterday could have been prevented by T189623. Data backup: - Actual number of queries - https://grafana.wikimedia.org/d/000000538/druid?refres... [14:16:51] (03CR) 10Michael Große: WIP: count number of Wikidata edits by namespace (032 comments) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE)) [14:18:10] (03PS1) 10Michael Große: Track number of links to Wikidata entity namespaces [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500966 (https://phabricator.wikimedia.org/T218903) [14:19:46] (03CR) 10Michael Große: "I have no idea how to test this locally with the docker setup. But I think it *should* work, assuming that the patch that this is based on" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500966 (https://phabricator.wikimedia.org/T218903) (owner: 10Michael Große) [14:30:47] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) >>! In T148843#5081277, @Miriam wrote: > Thanks @EBernha... [14:36:00] elukey: around? I've rewritten mjolnir's usage of KafkaRDD and can run it against a small dataset (previous run was ~3M items, this one is 200k). But i figure someone whould be around :) [14:39:42] ebernhardson: I am yes :) [14:41:09] alright, starting it up with the 200k item wiki [14:42:38] ack [14:43:23] it's only producing now, wont start reading for 10 or 20 minutes, rather cancel for now? [14:43:55] oh, ack is acknowledge :) i keep thinking ack like the comics... [14:44:21] http://ackthbbbt.blogspot.com/p/what-does-ack-thbbbt-mean.html [14:44:30] yes yes sorry it was "please go ahead I'll watch metrics" [14:44:31] :D [14:51:28] PROBLEM - cache_text: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [14:53:24] 10Analytics, 10EventBus, 10Beta-Cluster-reproducible, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Ability to create blocks broken - https://phabricator.wikimedia.org/T219737 (10mobrovac) [14:54:22] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Change permissions for daily traffic anomaly reports on stat1007 - https://phabricator.wikimedia.org/T219546 (10Jdcc-berkman) My memory of setting these perms is long gone, but if I did mark it this way, it was probably because there is private data from the st... [14:55:26] ouch [14:55:39] ebernhardson: --^ [14:57:22] hmm, it's not even to the part from before that caused issues, afaict. Right now its the produce stage from production (2 of 3) that is relativly low volume (150 msg/s, ~1.5MB/s) [14:58:00] ebernhardson: what topic? [14:58:06] so I can check the brokers [14:58:07] really weird [14:58:16] elukey: mjolnir.msearch-prod-response [14:58:43] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), 10Wikimedia-Incident: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10mobrovac) [14:58:56] that stage should finish in the next couple minutes [14:59:04] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10mobrovac) [14:59:10] RECOVERY - cache_text: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [14:59:16] wow a ton of partitions [14:59:30] i guess with this rewrite stages 2 and 3 don't have to be separate, it could read as it produces, with the KafkaRDD i had to know the ending offset before collecting from kafka [15:00:42] ok so this time I don't see ooms on the brokers [15:00:49] checking why varnishkafka complained [15:01:09] it hasn't started reading yet :) [15:01:10] even https://grafana.wikimedia.org/d/000000027/kafka?refresh=5m&orgId=1&from=now-1h&to=now is ok [15:01:25] ebernhardson: yeah but did you see the varnishkafka alarm? [15:02:34] elukey: yes, which is odd because before the problem was either converting records on read, or bandwidth overusage, but neither of those are occuring from the mjolnir code yet. Might suggest some other issue we didn't notice already [15:03:28] I am wondering if we are saturating the 1G nic [15:03:31] even with this traffic [15:03:54] it should start reading from kafka now [15:06:31] (03CR) 10Ladsgroup: "I can give it a try later." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500966 (https://phabricator.wikimedia.org/T218903) (owner: 10Michael Große) [15:07:05] milimetric: w.wiki/_ [15:07:15] wait what [15:07:38] haha, it normalized it :)) [15:07:44] which means 404 [15:08:06] nvm [15:08:07] Amir1: I’m not sure what you’re talking about [15:08:50] milimetric: sorry, according to https://etherpad.wikimedia.org/p/First_URLs I put w.wiki/_ redirecting to https://phabricator.wikimedia.org/T182700 [15:09:03] but the special page normalizes _ character and goes crazy [15:09:20] but everything else works fine like w.wiki/6 [15:10:25] elukey: reading should be complete. It looks like i have a bug with my end condition, the ending offset needs to be exclusive rather than inclusive (it's stalled waiting for a record with offset > last record) [15:10:26] Amir1: ok, but forgive me if I’m completely spacing out or missed some other conversation, but why are you telling me about this? [15:11:10] milimetric: because you made the ticket, I put the ticket as easter egg because it's one of funniest tickets in phabricator [15:12:23] sorry for the confusion [15:13:36] Amir1: oh! Hahahaha, I just looked at the task, sorry I didn’t get it before [15:14:03] yeah, that’s not even my joke, I think Andrew came up with it [15:14:27] ebernhardson: digging into varnishkafka metrics atm, really strange [15:14:44] TBH my absolute favourite thing on phab is now Dan's poem (w.wiki/6) [15:14:49] Alex found it [15:19:31] (03PS7) 10Bmansurov: Oozie: add article recommender [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) [15:21:09] (03CR) 10Bmansurov: "Andrew, I've uploaded the artifact. In the interest of time, can we review and merge the patch? I can follow up with more changes once we " [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [15:22:04] 10Analytics, 10EventBus, 10Beta-Cluster-reproducible, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: PHP Warning: Array key should be either a string or an integer - https://phabricator.wikimedia.org/T219738 (10mobrovac) [15:22:37] !log execute kafka preferred-replica-election on kafka-jumbo [15:22:42] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:25:35] a-team I'll go to SoS, already got the notes [15:25:47] ah snap I didn't realize that there was an imbalance count (small, ~15 partitions) in kafka-jumbo after the outage [15:25:51] now it should be good [15:26:09] vk shows horrible TLS latencies for kafka-jumbo1005 [15:26:19] that was the one causing the timeouts during the last alert [15:29:59] thanks fdans! I thought Joseph was back [15:33:57] ebernhardson: I think that there was an imbalance of partitions handled by kafka-jumbo1005, and the extra load caused timeouts.. after a preferred replica election it looks good [15:34:58] I am running ifstat on kafka-jumbo1005 and I can see a big drop in traffic handled by the NIC (spreading among other brokers) [15:35:10] I really suspect that 1G interfaces for jumbo are not enough [15:36:35] the OOMs were of course a problem [15:38:23] sounds reasonable, this is a smal lenough cluster could probably just install 10G nic's? [15:39:37] yes I just asked what options we have with dcops :) [15:45:42] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&from=now-6h&to=now [15:45:45] super interesting [15:46:04] I just renamed some graph titles in the dashboard, since they were misleading [15:46:25] the first two have always indicated the msgs waiting on the cp hosts to be sent to kafka [15:46:44] so all those peaks (like 100MB) were a big backlog [15:47:00] TLS latencies for kafka-jumbo1005 were horrible [15:47:15] (varnishkafka <-> jumbo I mean) [15:47:46] but after a preferred-replica-election everything went back into normal values [15:54:08] mforns: I never attend SoS unfortunetaly now that time changed - It collides with kids time [15:54:25] oh! sorry for that... [15:54:34] np mforns :) Thanks fdans :) [15:55:00] I missed mine anyway while I was on holiday, so I definitely owed this one to someone [15:55:07] ebernhardson: do you have time to retry later on the same job? [15:56:09] (03CR) 10Nuria: "Baho, you can execute job and test w/o it being merged. Were you able to test it all the way?" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [15:58:45] elukey: sure [16:00:02] ping milimetric [16:00:07] elukey: i can run it now if you want [16:01:37] ebernhardson: can you wait ~30mins that I am in a meeting? :D [16:01:45] just to be more reactive if anything happens [16:03:02] sure [16:05:25] thanks :) [16:05:29] will ping you when ready [16:15:55] 10Analytics, 10Analytics-Kanban: Enable pagecount-ez cron on stats boxes - https://phabricator.wikimedia.org/T220012 (10Nuria) [16:16:27] 10Analytics, 10Analytics-Kanban: Enable pagecount-ez cron on stats boxes - https://phabricator.wikimedia.org/T220012 (10Nuria) a:03JAllemandou [16:26:18] ebernhardson: if you have time, please go ahead [16:26:38] elukey: started, same 200k dataset [16:26:48] super [16:28:21] 10Analytics: Move FR banner-impression jobs to events (lambda) - https://phabricator.wikimedia.org/T215636 (10Jseddon) Just looping back in on this ticket @JAllemandou as to what is needed from us. [16:33:13] !lof Re-launch a manual mediawiki-history-checker with higher acceptance (man [16:33:28] !lof Re-launch a manual mediawiki-history-checker with higher acceptance (errors manually vetted) [16:34:05] s/lof/log :) [16:35:12] fdans: Can you remind me the email we should send the success email-to (product?) [16:35:47] hmmmm product--analytics@wikimedia.com [16:35:52] (confirmed by gmail) [16:35:57] single dash? [16:36:05] oh damn, yes [16:36:33] thanks fdans :) [16:38:11] elukey: will you have time tomorrow morning for us to spend a minute discussing kafka error? [16:39:13] joal: sure :) [16:39:21] I don't have a complete clear picture yet [16:39:28] still interested [16:39:40] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&from=1554301482048&to=1554306389102 is very interesting [16:40:03] varnishkafka is really sensitive now to partitions imbalance among brokers [16:40:31] So I am not sure if it is due to network bandwith, or kafka needing more brokers, or both [16:46:39] 10Analytics: Move FR banner-impression jobs to events (lambda) - https://phabricator.wikimedia.org/T215636 (10Nuria) @Jseddon : would you be so kind as to please describe the use case or functionality you are looking for? [16:51:12] elukey: iam lost as to what role tls plays here, is it just connections being slow and that manifesting in varnishkafka not being able to stablish a tls? [16:52:44] joal: sucess e-mail is been sent by oozie job , woudl that not be sent automatically once checker thresholds are adjusted and it passes? [16:53:00] joal: ah i see, you are re-starting job all over! [16:53:08] yes madam [16:53:51] elukey: mjolnir should be complete now [16:53:52] 10Analytics, 10Fundraising-Backlog: CentralNoticeImpression refined impressionEventSampleRate is int instead of double - https://phabricator.wikimedia.org/T217109 (10DStrine) I'm looping in @Ejegg and @AndyRussG for comment. [16:53:52] joal: me compredou now [16:54:17] nuria: email will be sent to product-analytics, and me - I'll ping here when done [16:54:26] joal: k [16:55:51] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) @elukey, i think that is teh global throttling which is ok at that level, we probably just want to restrict per method throttling [17:01:56] ebernhardson: nice, no impact registered [17:02:50] 10Analytics: Move FR banner-impression jobs to events (lambda) - https://phabricator.wikimedia.org/T215636 (10Jseddon) The goal is to be able to return to near realtime banner impression data within turnillo to achieve three purposes: * Verification of successful delivery of campaigns * As a diagnostic tool to... [17:03:58] nuria: TLS latencies are, I think, related to how busy is the kafka host to reply to produce requests.. and the backlog shown by the two graphs (total msg size etc..) is also another indicator I think [17:05:05] the thing that I am wondering is if we have reached a state in which we have less brokers than what we'd need, and a partition distribution imbalance causes troubles in latency sensitive clients like vk [17:05:11] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) @elukey on webrequest I would expect part... [17:05:49] nuria: I didn't get --^ [17:06:29] why the 1h data loss for eventlogging? I am missing something probably :( [17:13:43] (03CR) 10Bmansurov: "> Patch Set 7:" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/496885 (https://phabricator.wikimedia.org/T210844) (owner: 10Bmansurov) [17:15:05] PROBLEM - Hadoop Namenode - Stand By on an-master1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:15:24] whattttt [17:15:39] /o\ [17:17:33] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10JAllemandou) Shall I send a PR updating rate-limiting in restbase for edits-per-page requests to 10? https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L1551 [17:19:59] 2019-04-03 17:11:41,206 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Error encountered requiring NN shutdown. Shutting down immediately. [17:20:03] java.lang.IllegalStateException: Cannot start writing at txid 2921431615 when there is a stream available for read: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6fafd25 [17:20:07] never seen this before [17:21:08] mehhhh [17:28:13] elukey: has it sucessfuly restarted? [17:28:47] yes but now 1002 is active [17:29:00] I am a bit confused [17:29:39] So am I! Didn't the error say 1002 was shutting down? [17:30:09] yes and I have restarted 1002 [17:30:30] and it took the lead at restart ... double meeh! [17:34:21] joal: bc? [17:34:27] something weird is happening [17:34:32] OMW [17:34:44] we added 20M files in the past two hours [17:34:50] hm [17:34:53] not good [17:39:50] RECOVERY - Hadoop Namenode - Stand By on an-master1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:41:31] we just killed a huge job [17:41:36] that was hammering the namenodes [17:51:55] /mnt/hdfs might require some sort of poking as well, `hdfs dfs -ls /` is back up but ls /mnt/hdfs hangs indefinatly [17:55:12] ebernhardson: ah yes fuse hdfs doesn't like namenode changes [17:55:14] will do it in a bit [17:57:12] restarted namenode on 1001 joal [17:57:28] checking metrics [17:57:45] didn't really work [17:58:27] ok with a bit of brutal force it did [17:58:43] elukey --verbose -9 ? [18:02:19] 10Analytics: Move FR banner-impression jobs to events (lambda) - https://phabricator.wikimedia.org/T215636 (10Nuria) Question for FR-development team: where does this pipeline of data come from (eventlogging we hope, but double cheking) [18:02:37] I think it was an issue with the restart itself, it failed (systemd complained) but then a stop/start worked [18:02:49] the unit are basically wrappers on init.d scripts [18:02:56] that are not really great :D [18:03:22] elukey: ok - grafana not yet happy :( [18:03:36] 10Analytics: Move FR banner-impression jobs to events (lambda) - https://phabricator.wikimedia.org/T215636 (10DStrine) pinging @Ejegg and @AndyRussG directly [18:06:00] sorry team, my connection is full of hiccups today... [18:06:44] elukey: still a lot of files on 1001 (big discrepency) in grafana :( [18:07:25] joal: yeah, not sure if it is due to being secondary or jmx being stale [18:07:40] !log mediawiki-history-checker manual rerun successful [18:07:42] ok elukey [18:07:43] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:08:13] joal: ok if I failover active to 1001? [18:08:28] let's try elukey - I hope it'll catch up [18:10:14] it failed [18:10:36] it failed ... over? [18:10:55] number of files droped - looking better [18:11:06] nono it failed to failover :) [18:11:11] Ah [18:11:45] but now 1001 seems to be going through loading of inodes etc.. [18:11:49] looks promising [18:14:15] Loading 53054995 INodes [18:14:17] sigh [18:14:27] :( [18:14:49] I think that the fsimage that it tries to use is an "old" one [18:15:07] and of course the GC doesn't like it [18:17:11] 2019-04-03 18:16:52,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in 355 seconds. [18:17:24] and now it is contacting the journal nodes [18:17:33] hopefully it should understand that something happened :D [18:18:12] jmx seems to confirm that something positive happened [18:18:23] the two namenodes are agreeing on the files [18:18:26] 32M [18:18:28] \o/ ! [18:18:58] rebuilding state from fsimage has been tedious, but at least it seems correct :) [18:23:57] joal: if you agree I'd leave this state for a bit, just to let it stabilize for good [18:24:10] we could even do the failover tomorrow morning [18:24:15] shouldn't matter a lot [18:26:10] works for me elukey [18:26:32] ok gone for diner then :) [18:26:55] super :) [18:30:13] * elukey off to dinner! [18:31:54] ebernhardson: forced a remount of /mnt/hdfs on stat1007 [18:31:56] :) [18:32:20] thanks! [18:33:32] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) @JAllemandou let me verify that throttling at levels specified is actually happening. Will do this later on today. [18:35:39] (03CR) 10Lucas Werkmeister (WMDE): Track number of links to Wikidata entity namespaces (034 comments) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500966 (https://phabricator.wikimedia.org/T218903) (owner: 10Michael Große) [18:39:11] joal: what was the big job that was hammering the namenodes? [18:39:34] (03PS2) 10Lucas Werkmeister (WMDE): WIP: count number of Wikidata edits by namespace [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) [18:39:44] * nuria wonders why it is ALL happening at teh same time [18:39:49] (03CR) 10Lucas Werkmeister (WMDE): WIP: count number of Wikidata edits by namespace (033 comments) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/500752 (https://phabricator.wikimedia.org/T218901) (owner: 10Lucas Werkmeister (WMDE)) [18:40:26] elukey: let me verify that eventlogging 1 hr loss to make sure it is right, one sec [19:07:00] nuria: it was a spark job from a researcher [19:08:54] joal: i really think we need some kind of quota so this does not happen [19:09:24] nuria: quotas are one way, we were also thinking with Luca about alarms on HDFS files [19:09:40] here I am :) [19:09:58] I was watching metrics and the only out of the ordinary one is journal nodes' heaps [19:10:04] but probably no big deal [19:10:39] RPC calls could be a good one to monitor too https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&panelId=57&fullscreen [19:11:05] right [19:12:13] joal: there is a pyspark shell from the same user [19:12:36] elukey: this user has been fairly active in the past 2/3 months [19:13:26] yeah but I am wondering if he is going to test weird things before reading emails :D [19:14:27] elukey: I doubt it (without good reasons though) - The past shells have not triggered issues, and the one in fly seems inactive for nowb [19:14:29] since it was created some mins ago afaics [19:14:36] okok [19:15:11] joal: ok to attempt the failover again? [19:15:18] let's go :) [19:15:58] Failover to NameNode at an-master1001.eqiad.wmnet/10.64.5.26:8020 successful [19:16:01] \o/ [19:16:07] Nice elukey :) [19:16:38] !log failover from namenode on 1002 (currently active after the outage) to 1001 (standby) [19:16:39] oook - here we are, back to regular mode [19:16:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:16:45] yeah [19:19:20] ok if we're all good, gone to sleep then :) See y'all tomorrow [19:19:25] +1 [19:19:26] o/ [19:25:11] 10Analytics, 10ChangeProp, 10Community-Tech, 10EventBus, and 6 others: Provide the ability to have time-delayed or time-offset jobs in the job queue - https://phabricator.wikimedia.org/T218812 (10daniel) A quick note on process. Mooeypoo wrote: > Yes, I think that there's agreement on that in this ticket.... [19:39:57] joal, I'm testing the edit_hourly job and I'm having some problems with the datasets file... [19:40:42] I think it fails to recognize ${YEAR} and ${MONTH} properties [19:41:20] maybe it's because the coordinator does not define them, so when the datasets file is imported, it doesn't find the year and month? [19:41:33] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Add HelpPanel schema to the EventLogging whitelist - https://phabricator.wikimedia.org/T220033 (10nettrom_WMF) [19:41:46] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Add HelpPanel schema to the EventLogging whitelist - https://phabricator.wikimedia.org/T220033 (10nettrom_WMF) [19:41:47] oh, joal you're gone, sorry for that. tmorrow! [19:44:30] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10Tbayer) Thanks @fdans - we'll also need at least some of the standard fields from the event capsule, as in the case of previous EventLogging ingestio... [19:47:50] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Ingest data from PrefUpdate EventLogging schema into Druid - https://phabricator.wikimedia.org/T218964 (10Tbayer) [19:55:37] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Miriam) OK, I can prepare a task for this, or we can start from someth... [19:55:49] (03PS1) 10Nettrom: Add HelpPanel to EventLogging whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/501045 (https://phabricator.wikimedia.org/T220033) [19:58:44] PROBLEM - eventbus grafana alert on icinga1001 is CRITICAL: CRITICAL: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is alerting: EventBus POST Response Status alert. https://wikitech.wikimedia.org/wiki/EventBus [20:01:18] RECOVERY - eventbus grafana alert on icinga1001 is OK: OK: EventBus ( https://grafana.wikimedia.org/d/000000201/eventbus ) is not alerting. https://wikitech.wikimedia.org/wiki/EventBus [20:17:48] 10Analytics, 10Core Platform Team, 10EventBus: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (10dduvall) [20:19:20] 10Analytics, 10Operations, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) I wonder if we can use fashion mist to benchmark: https://resea... [20:32:08] 10Analytics, 10Core Platform Team, 10EventBus: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (10Jdforrester-WMF) ` 13:25:40 <+Pchelolo> James_F: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/499958/ is an ob... [20:38:48] 10Analytics, 10Core Platform Team, 10EventBus, 10Patch-For-Review: RefreshLinksJob::runForTitle: transaction round 'RefreshLinksJob::run' already started on commons - https://phabricator.wikimedia.org/T220037 (10Pchelolo) The patch above should fix it. This makes me reprioritize the merging of the JobRun... [20:56:46] PROBLEM - Hadoop Namenode - Stand By on an-master1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:00:53] joal: we need to kill this job again https://yarn.wikimedia.org/proxy/application_1553764233554_32389 [21:01:00] ottomata: yt? [21:02:06] nuria joseph and luca have gone for today [21:02:55] mforns: can you kill this job? application_1553764233554_32389 [21:03:08] mforns: we need to find someone from ops that can deactive crontab of user [21:04:35] nuria, looking into killing that [21:05:57] nuria, application killed [21:08:15] if we can't kill it before we all go to sleep, maybe we can make it so that it fails before it can do any harm [21:11:44] milimetric: can you work with ops to disable crontabs? [21:13:44] hello people [21:13:49] just seen nuria's email [21:14:26] :] [21:16:19] so the job has been killed right? [21:16:23] yes [21:16:30] this time was his pyspark shell [21:16:40] nuria: no worries, if there's anything to follow up on I'm here [21:16:51] I was suspicious about it beforehand sigh :( [21:17:08] ok so lemme check the master nodes, it took a bit to recover earlier on [21:17:12] elukey: can you just disable the user account until they answer the email [21:17:57] milimetric: there is no quick way to just disable the user afaik, I'd need to cut him off probably from ssh access [21:18:01] or something similar [21:18:20] elukey: yeah and kick any current session [21:18:40] I'm just trying to find you a shortcut so you can go back to sleep [21:18:51] and pointing out you don't need to show any mercy :) [21:18:55] nah it is good I was not asleep [21:20:30] ok restarted the namenode on an-master1002 [21:20:43] earlier on it took the leadership for some reason when coming up [21:20:47] let's see now what happens [21:21:12] RECOVERY - Hadoop Namenode - Stand By on an-master1002 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:21:33] still reading all the 50M inodes [21:21:36] it will take a bit [21:23:00] I checked crontabs for the user on stat/notebooks, nothing [21:23:14] I suspect that he is not reading emails [21:24:37] yeah, what's the shell name? I can try pinging on various chat platforms [21:25:00] the same as the one in the job [21:25:44] expiry contact indicates leila [21:27:05] yeah, no sign of anyone on IRC by something close to that name [21:31:40] looks that the master recovered [21:36:00] the user was connected to notebook1003 [21:36:16] but not anymore (I was trying to use 'write' via shell con send him a message) [21:39:50] milimetric: not sure about the next steps, I can't find a way to block the user in yarn [21:40:23] elukey: ok, no prob, if he acts up again, I'll ping someone else in ops [21:40:31] good to know where/how he was accessing the cluster [21:40:38] we should probably figure out a way to ban someone if we need [21:40:54] yeah agreed [21:41:04] in theory removing the username does the trick [21:41:14] on the an-master1001/2 hosts [21:42:04] because the mapping between posix and hdfs user/groups happens there [21:43:14] ok, I'll watch for emails from nagios, and act accordingly, no worries [21:44:01] mmm no what I wrote is not true, probably if the user is able to fire a yarn job it will not matter [21:44:43] one thing that I tried to do was to use 'last' on notebook1003 to find his pst, and use "write username pst/X" to ping him [21:47:53] ottomata, nuria: Is hadoop having issues right now? Any idea why I'm getting this error message? [21:47:55] Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, requirement failed: ./python/lib/py4j-0.10.7-src.zip not found; cannot run pyspark application in YARN mode. [21:47:55] java.lang.IllegalArgumentException: requirement failed: ./python/lib/py4j-0.10.7-src.zip not found; cannot run pyspark application in YARN mode. [21:48:57] bmansurov: it was having issues due to a big job, now it should be fine [21:49:10] elukey: ok, thanks [21:49:15] :) [21:57:23] milimetric: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/501067/ [21:57:26] this should be enough [21:57:28] merging [21:58:03] thanks elukey [22:00:07] running puppet on all nodes to remove access [22:00:34] elukey: k, sorry i was on meeting but i was about to enlist tim startling to do this very thing [22:00:45] elukey: super thanks and GOOD NITE! [22:07:29] command to remove is of course timing out on notebook1003 [22:10:12] 10Analytics, 10Analytics-Kanban, 10Core Platform Team (Modern Event Platform (TEC2)), 10Core Platform Team Backlog (Watching / External), and 2 others: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 (10Nuria) From eventlogging navtiming data (see no... [22:13:07] done! [22:14:04] yep just checked, all cleaned up [22:14:12] all right team going to bed :) [22:14:15] o/ [22:34:41] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) So, overall the course of the day in which we get 320,000 requests only 11 get throttled. Correcting my numbers, in the course of two hours (hours 10 and 11) we get 63797 by one iP... [22:42:04] hi guys! is this the right place to ask about A/B testing or some similar? we've got a link which takes people to a text box (subject line + textarea for content) which has a save button. i'd like to know how many people give up at that stage and close the tab. if i know this, i'll be able to decide whether or not we should change how the process is presented to the readers. [22:59:34] Sveta: kinda/sorta, this is more about analytics infrastructure. I'd suggest starting by looking at the WikimediaEvents extension, and EventLogging to see what other people are doing. Accessing the EventLogging data once collected requires some level of access to analytics infrastructure, or convincing an analyst to get it for you [23:04:14] ebernhardson: it would be great to access this data, it is for a wikimedia wiki and not my own. who is the best person to ask about this? [23:07:58] Sveta: honestly it would probably be significantly easier to get someone else to run the actual data analysis (by providing sql queries). You would require analytics cluster access and an NDA: https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Production_access [23:09:06] Sveta: what you would primarily need to do is define an eventlogging schema, and setup some javascript to send data to the eventlogging schema. [23:09:33] Sveta: https://www.mediawiki.org/wiki/Extension:EventLogging/Programming [23:10:26] i guess https://www.mediawiki.org/wiki/Extension:EventLogging/Guide might be better [23:10:59] ebernhardson: i wouldn't mind to get someone else to run the analysis for me, but i'm not sure what sql queries will be needed. i can show you which page and which text box i'm speaking of [23:11:50] Sveta: you have to create a schema (described in the guide). The sql query will be against a table with the same columns as whatever schema you define [23:12:18] without the schema, the data isn't tracked? [23:12:31] Sveta: by default nothing is tracked [23:12:36] ah [23:12:54] that requires sysop access right? [23:12:59] at the wiki [23:13:06] 10Analytics: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) Please see: https://github.com/wikimedia/restbase/pull/1108 [23:13:23] 10Analytics, 10Analytics-Kanban: AQS alerts due to big queries issued to Druid for the edit API - https://phabricator.wikimedia.org/T219910 (10Nuria) a:03Nuria [23:13:55] ebernhardson: ^^ [23:14:01] Sveta: kinda/sorta? Anyone can create schemas, they are just wiki pages on meta wiki. Such as https://meta.wikimedia.org/wiki/Schema:Edit [23:14:28] Sveta: but to track anything you will need to be able to deploy javascript code to the users browsers, which is typically done by submitting code patches to the WikimediaEvents mediawiki extension [23:15:09] we can just deploy the javascript as a on-by-default gadget on-wiki, no? or it is not enough because it does not have database access? [23:16:12] Sveta: that should work, although deploying code via wiki is scary to me :) Code-review is a wonderful thing for figuring out how something broke [23:17:38] Sveta: hmm, actually a gadget might not be able to register the schema, that part might have to be done from the WikimediaEvents (or another) mediawiki extension [23:17:48] per https://www.mediawiki.org/wiki/Extension:EventLogging/Programming#How_it_works [23:17:59] "Registration of the schema name and revision ID is needed for the client-side javascript event logging to work. Without it the provided javascript library will not process the events." [23:21:38] basically while the javascript could still call mw.track( 'event.MySchema', { ... });, it wont do anything until the schema is registered from a mediawiki extension [23:24:48] probably a `site-requests` task could alternately be filed in phab for the schema to be registered for a particular wiki [23:28:31] now i'm curious - what happens if you bypass mw.track() and send an event directly to the /beacon/event endpoint with a not yet existing schema name? presumably there is a schema whitelist filter somewhere in the pipeline https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging ? [23:28:41] (i guess ottomata knows) [23:37:32] HaeB: hmm, poking lightly at the eventlogging service i don't see anything that asks mediawiki about registered schemas. Best guess is it would process it? interesting [23:38:03] but there are a number of moving pieces, only otto would know :) [23:38:45] Ori used it here for events from a non-mediawiki site https://meta.wikimedia.org/wiki/Schema:WikimediaBlogVisit [23:40:36] no obvious mentions of that schema in mw-config or WikimediaEvents [23:40:42] so, probably it would process it