[06:51:14] 10Analytics-Kanban, 10User-Elukey: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303#3561161 (10Marostegui) And here we are again: ``` root@dbstore1002:/srv# df -hT /srv/ Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 6.4T 5.9T 556G 92% /srv ``` [07:45:30] 10Analytics, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561225 (10fgiunchedi) 05Resolved>03Open Reopening, rdkafka metrics for eventstreams is out of control since a couple of days ``` graphite1001:~$ du -hcs /var/lib/carbon/whisper/eventstreams/rdkaf... [07:45:50] 10Analytics, 10Operations: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561228 (10fgiunchedi) [07:46:02] 10Analytics, 10Operations, 10monitoring: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3106537 (10fgiunchedi) [08:01:51] 10Analytics, 10Operations, 10monitoring, 10Patch-For-Review: Eventstreams graphite disk usage - https://phabricator.wikimedia.org/T160644#3561239 (10elukey) The other step to take would be to limit the amount of data that we store for librkafka, because with so many clients it is impossible to keep track o... [08:25:31] 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3561268 (10elukey) ``` MariaDB [(none)]> show create database log; +----------+----------------------------------------------------------------+ | Database | Create Dat... [08:37:55] Hi elukey, anything to discuss this morning? [08:39:07] joal: o/ - I am about to start to restart yarn/hdfs daemons for the jvm updates [08:39:21] okey, I'll watch oozien [08:40:38] joal: archiva is also going to be restarte by Moritz as FYI [08:40:50] elukey: thanks for letting me know [08:41:12] elukey: I wonder if those restart don't trigger errors for us, it'll be interesting to check :) [08:42:04] archiva restarted, let me know if you spot any problems [08:43:39] moritzm: hi ! Thanks for that, will check [09:37:55] (03PS3) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [09:39:12] nodemanagers almost restarted, after that will proceed with hdfs [09:39:39] (03CR) 10jerkins-bot: [V: 04-1] Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal) [09:39:55] 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3561391 (10jcrespo) Are you sure you want uft8 and not utf8mb4? You may be losing extended area characters (e.g. emojis). [09:41:23] 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3561394 (10Marostegui) >>! In T170952#3561391, @jcrespo wrote: > Are you sure you want uft8 and not utf8mb4? You may be losing extended area characters (e.g. emojis).... [09:46:51] started with hdfs, with 2 hosts at the time and 120s of sleep between each batch [09:49:05] 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3561399 (10elukey) Yep I made it equal to db1047, but I am open to suggestions :) [09:52:23] (03PS4) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [09:54:16] (03CR) 10jerkins-bot: [V: 04-1] Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal) [10:28:54] (03PS5) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [10:28:56] (03PS6) 10Joal: Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101) [10:28:59] (03PS5) 10Joal: [WIP] Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550) [10:31:15] (03CR) 10jerkins-bot: [V: 04-1] Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 (owner: 10Joal) [10:31:17] (03CR) 10jerkins-bot: [V: 04-1] Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101) (owner: 10Joal) [10:31:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550) (owner: 10Joal) [10:37:27] 10Analytics, 10Contributors-Analysis: Bring the Editor Engagement Dashboard back - https://phabricator.wikimedia.org/T166877#3561513 (10Halibutt) @Neil_P._Quinn_WMF; @Nuria, any successes? [10:45:55] joal: all worker nodes restarted, now we'd need to do the usual hive/oozie/etc.. and finally the master nodes [10:46:08] ok elukey [10:46:25] I'm still here, then I'll take my usual break [10:48:46] joal: I think we can do it after the break [10:48:51] k [10:53:02] all right going to lunch! [10:53:29] Bye elukey :) [10:54:41] helooooo [10:55:04] Hi mforns :) [10:55:12] mforns: you missed elukey by not much ;) [10:55:21] hi joal :] [10:55:26] I know... [10:55:37] will ping him later [11:26:13] (03PS6) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.2.0 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [11:26:23] I am back! [11:26:58] I am going to restart/failover the master nodes now, should be available in ~30mins [11:32:23] (03PS7) 10Joal: Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101) [11:32:25] (03PS6) 10Joal: [WIP] Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550) [11:33:22] restarted daemons on an1002 [11:35:53] (03CR) 10Mforns: "LGTM in general! Awesome that you did it with so few code! Some comments inline." (036 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/373961 (https://phabricator.wikimedia.org/T174174) (owner: 10Joal) [11:37:53] failed over hdfs to 1002 [11:44:18] aaand done, 1001 back as master [11:44:40] looks awesomely good elukey :) [11:49:09] oozie not complaining! \o/ [12:04:21] joal: speaking about oozie, I'd stop jobs and let everything drain to allow a better restart of hive/oozie daemons [12:04:32] PROBLEM - HDFS missing blocks on analytics1001 is CRITICAL: CRITICAL: 75.86% of data above the critical threshold [5.0] [12:04:55] oh boy [12:05:01] (03CR) 10Fdans: "Just a couple comments generally agreeing with mforns and one nit, but it's looking great!" (034 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/373961 (https://phabricator.wikimedia.org/T174174) (owner: 10Joal) [12:05:01] this is the new alarm [12:05:08] sounds good to me elukey - Please pause and resume, don't stop and restart :) [12:05:13] yep yep [12:07:30] weird I can only see corrupt blocks [12:07:31] https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=39&fullscreen&orgId=1&from=now-3h&to=now [12:08:10] ahhaah the title of the alarms is flipped [12:08:17] in puppet, going to fix it now [12:09:25] joal: I'd execute hdfs fsck -list-corruptfileblocks [12:10:29] mmm not sure if the option is valid [12:11:27] ah yes seems to be there [12:13:00] elukey@analytics1001:~$ sudo -u hdfs hdfs fsck / -list-corruptfileblocks [12:13:03] Connecting to namenode via http://analytics1001.eqiad.wmnet:50070 [12:13:05] The filesystem under path '/' has 0 CORRUPT files [12:14:07] Mwarf :( [12:16:28] ok fixed the alarms description in puppet [12:18:49] lemme see what a regular fsck reports [12:23:05] cluster is healthy, nothing reported [12:23:25] that's weird that this alarm fired then [12:24:21] I am re-running it with -openforwrite [12:24:25] just in case [12:27:17] all good again [12:27:20] the metric is Hadoop.NameNode.analytics1001_eqiad_wmnet_9980.Hadoop.NameNode.FSNamesystem.CorruptBlocks.mean [12:30:03] ah no [12:30:06] elukey@analytics1001:~$ sudo -u hdfs hdfs dfsadmin -report [12:30:06] Configured Capacity: 1937161485090816 (1.72 PB) [12:30:06] Present Capacity: 1915101328101666 (1.70 PB) [12:30:06] DFS Remaining: 538274591326004 (489.56 TB) [12:30:06] DFS Used: 1376826736775662 (1.22 PB) [12:30:08] DFS Used%: 71.89% [12:30:11] Under replicated blocks: 0 [12:30:13] Blocks with corrupt replicas: 22 [12:30:16] Missing blocks: 0 [12:30:18] Missing blocks (with replication factor 1): 0 [12:32:51] so analytics1055 is down now because of a broken disk, it is maybe the source of the confusion [12:34:45] for some reason it might have be picked up now after the restart [12:40:12] so from https://grafana.wikimedia.org/dashboard/db/analytics-hadoop?panelId=39&fullscreen&orgId=1&from=now-3h&to=now it seems that the corrupt blocks reported by HDFS were rising in the sequence used to restart the HDFS master node daemons [12:40:22] first 1002 then 1001 after minutes [12:41:04] I am pretty sure this is a weirdness of restart + an1055 down [12:41:42] RECOVERY - HDFS missing blocks on analytics1001 is OK: OK: Less than 60.00% above the threshold [2.0] [12:43:01] elukey: if we only get alerts after restarts, that's not really efficient :) [12:43:14] ah recovery [12:44:08] no this is only icinga switching names after my puppet patch [12:44:35] joal: I agree, but it must be a special weird case for dfs report etc.. [12:44:41] or maybe a little bug [12:46:04] !log suspend oozie jobs from Hue to allow a easier restart of oozie/hive daemons [12:46:05] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:46:30] (03PS7) 10Joal: Upgrade scala to 2.11.7 and Spark to 2.1.1 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/348207 [12:46:51] PROBLEM - HDFS corrupt blocks on analytics1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] [12:50:23] o/ :) [12:51:54] Taking my break a-team [12:52:03] o/ [12:55:33] 10Analytics, 10Wikimedia-Stream: Stop tracking EventStreams client lag in graphite - https://phabricator.wikimedia.org/T174435#3561871 (10Ottomata) [13:13:42] bearloga: o/ - I'd need to restart hive server/metastore during the next couple of hours for jvm updates, I saw that you are running a job so if you could stop after it I'd be super happy :) [13:22:31] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3481107 (10Krenair) Was the maintain-views step not performed? ```MariaDB [hiwikiversity_p]> show tables; Empty set (0.00 sec)``` [13:23:02] elukey: are you restarting whole cluster? [13:23:05] just curious [13:23:26] ottomata: o/ - sorry just seen the pages :( [13:23:39] so yes I've restarted the whole thing [13:23:42] including masters [13:23:59] and after restarting the HDFS master daemons corrupt blocks rose [13:24:07] but fsck doesn't report anything [13:24:17] meanwhile dfsadmin -report does [13:24:23] that is super weird [13:24:36] analytics1055 is down for maintenance due to a broken disk [13:24:48] so I suspect that it might be related [13:26:26] hmm, seems unlikely, since its been down so long, i would expect for blocks on an55 to be replicated elsewhere [13:26:31] and if they weren't, i wouldn't think they'd be corrupt [13:26:33] but missing [13:26:40] buuut, maybe it has somethign to do with restarts [13:26:48] some writes get funky during a restart [13:26:54] and when it comes back online it has to fix them [13:27:29] elukey: mostly asking so I can resolve T172018 :) [13:27:29] T172018: Cannot request more than 4 cores per spark executor - https://phabricator.wikimedia.org/T172018 [13:28:13] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561980 (10Reedy) >>! In T171829#3553943, @Marostegui wrote: > The blocker is fixed and so is this one too: > ``` > mysql:root@localhost [hiwikiversity_p]> show t... [13:29:57] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561986 (10Marostegui) >>! In T168765#3561969, @Krenair wrote: > Was the maintain-views step not completely performed? > ```MariaDB [hiwikiversity_p]> show tables... [13:30:09] ottomata: ahahah yes you can resolve it [13:31:02] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3561990 (10Marostegui) >>! In T168765#3561986, @Marostegui wrote: >>>! In T168765#3561969, @Krenair wrote: >> Was the maintain-views step not completely performed... [13:31:46] ottomata: anyhow, I have also checked fsck for open files for write, all reported as healthy [13:31:53] this is why I was a bit puzzled [13:32:38] 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Patch-For-Review: Make webperf eventlogging consumers use eventlogging on Kafka - https://phabricator.wikimedia.org/T110903#3561991 (10Ottomata) Done! Looking ok! You should really use `base::service_unit` in puppet to manage your systemd servi... [13:32:52] aye [13:33:55] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Cannot request more than 4 cores per spark executor - https://phabricator.wikimedia.org/T172018#3561992 (10Ottomata) @EBernhardson, Luca just restarted the cluster. Can you tell if the change we merged fixes this? [13:38:25] ottomata: as FYI I restarted kafka1012->1014 too today [13:39:14] k cool [13:39:38] elukey: kqsl looks really fun [13:39:43] i'm pretty eager to try it! [13:40:49] it looks cool indeed! [13:42:53] ha! they support JSON before Avro?! [13:42:55] KSQL currently supports formats: [13:42:56] • DELIMITED (e.g. CSV) [13:42:56] • JSON [13:42:56] Support for Apache Avro is expected soon. [14:07:55] ok oozie restarted on an1003, just need to do hive when bearloga's hive jobs are completed [14:08:01] for the moment our jobs are stopped [14:20:11] 10Analytics-Kanban, 10User-Elukey: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3562106 (10mforns) Hi all! I'm seeing insertions into mariaDB for Popups events. About 3000 every 4 minutes (12 evt per second approx). ``` 2017-08-29 13:54:59,108 [22699] (Mai... [14:38:04] elukey: I think bearND|afk jobs are run from reportupdater, and given the dates it's currently working, I think it'll not be done for a while [14:38:31] 10Analytics-Kanban, 10User-Elukey: Calculate how much Popups events EL databases can host - https://phabricator.wikimedia.org/T172322#3562155 (10elukey) I'd prefer to wait a bit before closing, space consumption went a bit up afaics: {F9205031} It was of course expected but since dbstore1002 is still not out... [14:38:35] elukey: I think a failed job is reran automatically in reportupdater, mforns, can you confirm? [14:40:49] yeah just restarted hive daemons as soon as I saw a new query popping up [14:42:42] re-enabling camus and then oozie jobs [14:47:16] Cluster restart completed [14:52:44] Thanks a lot elukey :) [14:56:00] 10Analytics, 10Analytics-Wikistats, 10Operations, 10Wikidata, and 6 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3483663 (10chasemp) >>! In T168765#3561990, @Marostegui wrote: >>>! In T168765#3561986, @Marostegui wrote: >>>>! In T168765#3561969, @Krenair wrote: >>> Was the m... [14:57:27] joal, elukey, reading about RU [14:58:26] joal, elukey, yes, RU should rerun the missing data [14:58:35] great :) [14:58:57] thinking of that, I think I'll ask bearND|afk if he can update his jobs to run in nice queue :) [14:59:54] 10Analytics-Kanban, 10User-Elukey: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T168303#3562196 (10elukey) Back to acceptable levels: ``` /dev/mapper/tank-data 6.4T 5.5T 946G 86% /srv ``` [15:02:12] 10Analytics-Kanban, 10Analytics-Wikistats: Use daily granularity for 1-month time ranges - https://phabricator.wikimedia.org/T173372#3562201 (10fdans) [15:02:31] fdans: yoohoo [15:16:53] joal: +1 for the nice queue [15:28:38] 10Analytics-Kanban, 10User-Elukey: Archive PageContentSaveComplete in hdfs while we continue collecting data - https://phabricator.wikimedia.org/T170720#3440380 (10Ottomata) Before we close this, we should move the tables that Nuria created in Hive to a new 'archive' database. [15:30:17] 10Analytics-Kanban, 10DBA, 10Patch-For-Review: Inconsistent default charset for analytics slaves - https://phabricator.wikimedia.org/T170952#3448710 (10Ottomata) Hm, for now I think this is good, so that the slave matches the master. If we want everything to be a different charset, we should change it as a... [15:37:36] elukey: quick cumin q: can I use it to tail logs? [15:37:39] on multiple hosts? [15:43:23] joal: which jobs are you referring to? I'm missing context here. [15:43:37] ottomata: yarn logs --applicationId application_1504006918778_0236 | less [15:44:28] bearND: The jobs that currently under your username on the cluster: they use a lot of resource in the user queue, and since they are automated (or so I guess), having them run in nice queue could be, well, nice :) [15:45:04] ottomata: I don't think so [15:45:08] joal: on what machine? I don't know anything about them. [15:45:21] on scb cluster? [15:45:29] bearND: hadoop cluster [15:45:49] joal: like a hive query? I've never run a hive query [15:46:09] bearND: I think I have made a very wrong mistake of names - I apologize [15:46:21] bearND - I was after bearloga ... [15:46:26] * joal hides in shame [15:46:32] joal: oh, i see. [15:47:08] joal: Yeah, that is a different person. He started after me but it easy to type the wrong one in chat. [15:47:20] Please excuse me again [15:47:24] bearND: --^ [15:47:46] bearloga: those comments on reportupdater jobs in nice queue were actually for you :) [15:47:47] joal: no worries. I'm just glad you get to ping the correct person :) [15:47:57] So am I bearND ! [15:50:45] joal: fwiw we run our reportupdater scripts with nice and ionice [15:51:07] bearloga: It's about using the nice queue in hadoop :) [15:53:18] joal: ah. i don't know anything about that but if there's something we can do to make our daily runs of scripts & queries even friendlier, i'm open to recommendations :) [15:55:23] bearloga: o/ [15:55:41] elukey: hi! o/ [15:55:58] bearloga: We added a new queue in the cluster for this purpose - Let me see how we do that in hive [15:56:01] ottomata: whenever you want we can chat about certpy, or do it tomorrow during your morning [15:57:56] bearloga: In hive scriptm before executing the query: set mapred.job.queue.name=nice; [15:58:31] This queue has less priority than the user one, so users get results faster - but given the cluster activity globaly, jobs will finish fast enough I think :) [15:58:37] bearloga: --^ [16:02:22] joal: so if foo.hql is the query being executed (for example), does `set mapred.job.queue.name=nice` go inside foo.hql, or is it mapred.job.queue.name set with `hive -hiveconf` or it doesn't matter? [16:04:56] bearloga: I think it doesn't matter - I tried both successfully (actually, 2 dashes for hiveconf: --hiveconf mapred.job.queue.name=nice [16:06:07] joal: got it, thank you! [16:06:14] thanks you bearloga :) [16:08:36] joal: is the new nice queue documented anywhere on wikitech or in a phab ticket? (just so I have something to link to in the patch notes) [16:09:52] elukey: i got 20 mins til next meeting, want a certpy overview? [16:10:00] ottomata: sur [16:10:06] bc [16:10:07] bearloga: we have a phab ticket: T156841 [16:10:07] T156841: Hadoop: Add a lower priority queue: nice queue - https://phabricator.wikimedia.org/T156841 [16:10:18] bearloga: We should document it as well, but not yet done [16:14:34] joal: thanks! [16:18:52] bearloga: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Run_long_queries_in_a_screen_session_and_in_the_nice_queue [16:18:55] as wsell :) [16:25:08] joal: *thumbs up* [16:36:11] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Cannot request more than 4 cores per spark executor - https://phabricator.wikimedia.org/T172018#3562540 (10EBernhardson) `hdfs getconf` now reports 32, and spinning up a spark repl with 8 cores per executor is able to get executors and run code. Lo... [16:41:43] (03CR) 10Mforns: "Hi Nettrom, sorry for the delay." (033 comments) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/373373 (owner: 10Nettrom) [16:52:46] joal: could you please take a look at https://gerrit.wikimedia.org/r/#/c/374569/ and +1 if I did it correctly? just want to make 100% sure before chelsyx and I merge it [16:56:51] 10Analytics-Kanban, 10RESTBase-API, 10WMF-Legal, 10Patch-For-Review, 10Services (done): License for pageview data - https://phabricator.wikimedia.org/T170602#3562642 (10mforns) Hi @Pchelolo :] I saw that https://wikimedia.org/api/rest_v1/#/ now has the updated licenses. Can I move this to done in our boa... [16:57:27] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562647 (10RobH) [16:57:46] 10Analytics-Kanban, 10RESTBase-API, 10WMF-Legal, 10Patch-For-Review, 10Services (done): License for pageview data - https://phabricator.wikimedia.org/T170602#3562693 (10Pchelolo) 05Open>03Resolved All done here indeed, resolving the ticket. [17:25:41] 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis: Private data access for non-person user that calculates metrics - https://phabricator.wikimedia.org/T174110#3562852 (10Ottomata) Ok, met with Chase and Luca, and we decided that Option 2 is the way to go. I'll make a subtask... [17:30:37] 10Analytics, 10Operations: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3562875 (10Ottomata) [17:30:47] chasemp: https://phabricator.wikimedia.org/T174465 [17:30:57] does my description match what we just talked about? [17:31:03] * elukey off ! [17:31:08] laters! [17:34:14] Hey, I'm running query against ApiAction data for wikidata but it seems it only records core actions and not wikibase actions [17:34:25] I'm talking about this: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/ApiAction [17:34:58] When I run it on action = 'query' it works but when I run it on action = 'wbgetentities' it doesn't [17:35:06] (gives me null results) [17:35:35] I probably need to file a phab card but was checking if it's something obvious that I'm missing [17:40:51] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562940 (10RobH) [17:45:56] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3562979 (10RobH) So I realized that: d-i partman-auto/choose_recipe es was in the recipe, and isn't needed since it doe... [18:21:07] 10Analytics: ApiAction log in data lake doesn't record Wikibase API actions - https://phabricator.wikimedia.org/T174474#3563202 (10Ladsgroup) [18:21:18] Filed: https://phabricator.wikimedia.org/T174474 [18:21:39] 10Analytics, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: ApiAction log in data lake doesn't record Wikibase API actions - https://phabricator.wikimedia.org/T174474#3563216 (10Ladsgroup) [18:23:18] 10Analytics, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: ApiAction log in data lake doesn't record Wikibase API actions - https://phabricator.wikimedia.org/T174474#3563221 (10Ladsgroup) [18:33:05] ottomata: Based on https://groups.google.com/forum/#!topic/druid-development/wYHQznyonW0, can we try to add the "druid.indexer.task.chathandler.type=announce" to overlord and middlemanager conf? [18:37:08] looking [18:39:25] joal: i'm not so sure that is a current config propertly [18:39:28] property [18:39:34] hm [18:39:39] looking though [18:40:05] I Can't think of something else (in /etc/druid/[overlord|middlemanager]/runtime.properties [18:40:51] i can't find docs for it [18:41:13] Arf :( [18:42:50] but joal in that thread you linked to, the error was something about time stamp not being parsed [18:43:00] were you able to get rid of that warning we saw? [18:43:02] ottomata: from https://groups.google.com/forum/#!topic/druid-user/rCmhhJ67iw4, seems to be in runtime.properties, but no real doc indeed [18:43:04] about partitiong string? [18:43:16] yeah, joal it must be an old depracted property maybe? [18:43:17] ottomata: trying again to triple check [18:43:19] not sure though [18:43:31] also ottomata, I double checked task creation, looks good [18:46:45] * imandes waves to everyone [18:48:05] ottomata: still the same warning - investigating in that direction - thanks :) [18:48:15] Hi @ottomata, seems that I don't get any events from Eventstreams. Everything worked fine a few days ago (I am writing a bot that feeds on it). I checked on my computer and on: https://codepen.io/ottomata/pen/VKNyEw/?editors=0010. [18:48:44] imandes: hi! [18:48:47] interseting! [18:48:48] looking [19:08:56] 10Analytics: Add redirect and pagelinks tables for partition repair in sqoop job for mediawiki history - https://phabricator.wikimedia.org/T174484#3563442 (10mforns) [19:17:51] 10Analytics-EventLogging, 10Analytics-Kanban, 10AbuseFilter, 10CirrusSearch, and 29 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3563493 (10Mattflaschen-WMF) Thanks for running this check. Flow looks good. ``` File: /home/reedy/git/mediawiki/core/extens... [19:18:17] 10Analytics-EventLogging, 10Analytics-Kanban, 10AbuseFilter, 10CirrusSearch, and 29 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3563503 (10Mattflaschen-WMF) [19:22:19] 10Analytics-EventLogging, 10Analytics-Kanban, 10AbuseFilter, 10CirrusSearch, and 29 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3563526 (10Reedy) `func_get_args` is very false positive/naive check. Upstream issue filed, with various examples provided at... [19:30:40] ottomata: you were RIGHT ! As usual :) [19:31:14] ottomata: Tranquility doc on how to represent data for it to wirk is really not good :) [19:33:13] joal?! [19:33:15] you got it working? [19:33:15] ! [19:33:23] I DID IT :) [19:33:32] !!!!! [19:33:35] WHAAA AAMZING [19:33:57] Not that difficult once you know the hidden glitches [19:34:39] Now I'm interested to see if my spark job goes well through the night and mre :) [19:35:02] ottomata: Do you have a minute for a scala question? [19:35:42] joal: in a few i will [19:35:48] petr and i broke eventstreams yesterday [19:35:51] well, we broke mirror maker [19:36:00] meaning no events were making it to analytics cluster [19:36:01] :o [19:36:04] fixing now [19:36:09] !log restarting all kafka brokers and mirror maker processes to apply https://gerrit.wikimedia.org/r/#/c/374610/ [19:36:21] wow - 7 years for a mirror ... What about a mirror-maker?!! [19:36:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:36:48] haha [19:38:24] elukey: ahh i need to restart all kakfa brokers [19:38:29] did you want me to do jvm upgrade? [19:38:31] how do I do it? [19:38:33] moritzm: ? [19:39:55] java is alredy upgraded on the kafka brokers [19:40:24] I'm not sure which servers were already restarted by Luca, didn't follow that closely today, but it's likely in SAL [19:40:48] ok great [19:40:50] thanks [19:58:14] ottomata: My spark job tells me you've restarted some brokers :) [19:58:54] ohyeah [20:00:18] ottomata: Thanks for having pushed me to investigate in format more :) [20:00:29] yaaaa! so glad that works [20:00:33] sucks that that wasn't anywhere in logs [20:00:43] well, more about the error [20:02:51] (03PS1) 10Mforns: Add pagelinks and redirect to refinery-drop-mediawiki-snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/374623 (https://phabricator.wikimedia.org/T174484) [20:24:16] 10Analytics, 10Wikimedia-Stream: Alerts for common/import EventStreams topic volume - https://phabricator.wikimedia.org/T174493#3563830 (10Ottomata) [20:28:59] 10Analytics, 10Analytics-Cluster, 10Operations, 10ops-eqiad, 10User-Elukey: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3563854 (10RobH) Ok, so putting the recipe info to ignore noswap requires: partman-basicfilesystems partman-basicfilesystems/no_... [20:29:44] Amir1: (re https://phabricator.wikimedia.org/T174474 ) i guess bd808 may have some knowledge about ApiAction [20:30:08] HaeB: thanks [20:32:21] (03CR) 10Ottomata: [C: 031] Add pagelinks and redirect to refinery-drop-mediawiki-snapshots [analytics/refinery] - 10https://gerrit.wikimedia.org/r/374623 (https://phabricator.wikimedia.org/T174484) (owner: 10Mforns) [20:32:41] 10Analytics, 10Wikimedia-Stream: Alerts for common/important EventStreams topic volume - https://phabricator.wikimedia.org/T174493#3563894 (10Ottomata) [20:40:53] ottomata: eventstreams seems to be back to normal, thanks! :) [20:44:03] (03PS7) 10Joal: Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550) [20:44:59] (03Abandoned) 10Joal: Improve resiliency of Banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/359461 (https://phabricator.wikimedia.org/T169101) (owner: 10Joal) [20:52:28] 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Patch-For-Review: Make webperf eventlogging consumers use eventlogging on Kafka - https://phabricator.wikimedia.org/T110903#3563960 (10Krinkle) @Ottomata Thanks a lot for doing that. Looking at the Navigation Timing metrics in Graphite through Gr... [20:52:49] 10Analytics, 10Analytics-EventLogging, 10Performance-Team, 10Patch-For-Review: Make webperf eventlogging consumers use eventlogging on Kafka - https://phabricator.wikimedia.org/T110903#3563963 (10Krinkle) [21:25:02] (03PS8) 10Joal: Add tranquility to the banner streaming job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/373030 (https://phabricator.wikimedia.org/T168550) [21:25:05] ottomata: if you have a minute, do you mind having a look --^ ? [21:57:16] (03PS2) 10Nettrom: Add page creation configuration and queries [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/373373 [22:30:28] 10Analytics, 10Wikimedia-Stream, 10Wikimedia-Incident: Alerts for common/important EventStreams topic volume - https://phabricator.wikimedia.org/T174493#3564376 (10greg) [22:55:21] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (4/4) - Detail page - https://phabricator.wikimedia.org/T170940#3564431 (10fdans) [23:01:13] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (4/4) - Detail page - https://phabricator.wikimedia.org/T170940#3564463 (10fdans) [23:05:58] 10Analytics, 10DBA: Drop MoodBar tables from all wikis - https://phabricator.wikimedia.org/T153033#3564480 (10demon) Been about a month and a half. Bump? [23:08:09] 10Analytics-Kanban, 10Analytics-Wikistats: Productionise line graph - https://phabricator.wikimedia.org/T171766#3564488 (10fdans) [23:08:11] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (4/4) - Detail page - https://phabricator.wikimedia.org/T170940#3564489 (10fdans) [23:08:13] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (2/4) - Wiki selector - https://phabricator.wikimedia.org/T170936#3564490 (10fdans) [23:08:15] 10Analytics-Kanban, 10Analytics-Wikistats: Wikistats2 bugs (1/4) - Dashboard and general UI - https://phabricator.wikimedia.org/T170933#3564491 (10fdans) [23:08:23] 10Analytics-Kanban, 10Reading-analysis: Final Vetting of Family Wide unique devices data - https://phabricator.wikimedia.org/T169550#3401448 (10ksmith) @Tbayer : Do you have a status update on this? Thanks! [23:20:18] RECOVERY - HDFS corrupt blocks on analytics1001 is OK: OK: Less than 60.00% above the threshold [2.0]