[00:09:38] 10Analytics, 10Analytics-EventLogging, 10Cognate, 10Collaboration-Team-Triage, and 14 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3957089 (10Krinkle) [00:10:03] 10Analytics, 10Analytics-EventLogging, 10Cognate, 10Collaboration-Team-Triage, and 14 others: Possible WMF deployed extension PHP 7 issues - https://phabricator.wikimedia.org/T173850#3541977 (10Krinkle) [00:11:09] 10Analytics, 10Analytics-EventLogging, 10Cognate, 10Collaboration-Team-Triage, and 14 others: Possible PHP7 compatibility issues in WMF-deployed extensions - https://phabricator.wikimedia.org/T173850#3541977 (10Krinkle) [03:07:59] 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, and 2 others: Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3957368 (10mobrovac) [03:09:44] 10Analytics, 10ChangeProp, 10EventBus, 10Reading-Infrastructure-Team-Backlog, 10Services (doing): Update node-rdkafka version to v2.x - https://phabricator.wikimedia.org/T176126#3614140 (10mobrovac) The SCB cluster has been upgraded today and the relevant services deployed (a big shout out to @Pchelolo a... [03:15:28] (03CR) 10Milimetric: [C: 04-1] "-1 is only for the set difference problem, the rest is fine and I'll rebase on top of this and move --generate-jar out anyway" (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408930 (https://phabricator.wikimedia.org/T186541) (owner: 10Joal) [08:10:14] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Patch-For-Review, and 2 others: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3957602 (10elukey) It seems that recently https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=888673 was opened to maintain golang-github-karrick-gos... [08:11:26] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3957604 (10JAllemandou) @bmansurov: The last snapshot I realized was beginning of 2017-06 (named 2017-05, since the last full month is May 2017). It's available in two formats: - hdfs:///user/joal/wmf/data/raw/mediawiki/xmldumps/... [08:38:04] (03CR) 10Joal: Update sqoop-mediawiki-tables script (034 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408930 (https://phabricator.wikimedia.org/T186541) (owner: 10Joal) [08:38:26] (03PS3) 10Joal: Update sqoop-mediawiki-tables script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408930 (https://phabricator.wikimedia.org/T186541) [08:38:36] elukey: Good morning [08:38:43] elukey: any opinion on --^ ? [08:44:29] joal: hello! [08:46:02] so I'd prefer not to have to maintain code that sends emails if possible, would it be hard to just split between stdout/stderr? That way we could benefit from it when used via CLI or via cron [08:46:43] elukey: problem is that I don't know how sqoop uses stdout/err [08:49:09] elukey: I'll try to have it done the correct wya [08:49:43] joal: not sure if I am understanding, but the python script does have already all the logic to report and error via email no? It would be best imo to just emit it to stderr [08:49:53] so via CLI it will be useful too [08:50:20] but probably there are things that I am not aware :( [08:50:20] elukey: CLI emits logging - so you'll everything there [08:51:17] elukey: the cron is setup so that it logs into a file [08:51:29] COMMAND >> file 2>&1 [08:52:09] If I understand correctly, this sends COMMAND stdout to file, and then COMMAND stderr to stdout - right/. [08:52:12] ? [08:53:21] so we need to decide first how we wanna structure our crons [08:54:09] the current idea is to have stderr emitted so it will alert us via email, but in this way we don't get it logged to the file (if not specifically handled by a logger etc..) [08:54:34] elukey: THis seems like a recurring question any time we face this kind of thing :) [08:54:40] yeah [08:55:15] elukey: We could tee stderr? [08:55:19] not super nice [08:56:03] elukey: What I think we'd like is: stdout + stderr --> file AND stderr --> AMILTO [08:56:04] but in this case, it seems to me that the script already logs the error msg to the file, and then uses it again in an email [08:56:14] correct elukey [08:56:28] so probably the extra code to send the email is not necessary no? [08:56:37] elukey: Actually, the script logs the message to stderr, and sends an email [08:57:11] elukey: the ERROR in the file because of the redirects above [08:57:23] I think [08:57:29] :q [08:57:32] oops [08:58:33] the main issue that I can see with this approach is having the email sender code copied in multiple scripts here and there in the long term [08:58:51] I hear that elukey [08:59:09] I am not opposing to your idea joal, only trying to figure out the best for maintainability [08:59:56] elukey: I don't mind one way or another actually - It's that so far we've not put effort into normalizing our crons [09:01:24] we kinda have the guideline of splitting stdout/stderr now iirc, but we can discuss it.. the best would be to have a logger to a file in which we store whatever we want, and then use stderr to report and error when needed [09:01:32] so we have the best of both world [09:01:56] if --log-to-file-something /var/something.log is not defined, all goes to stdout [09:47:53] elukey: I'm fighting with external calls ... Would you have a minute for batcave? [09:58:10] joal: in 10 min? In the middle of something :( [09:58:19] sure [10:23:15] joal: all right finished fighting with a debian build, ready :) [10:23:37] elukey: I actually managed to find a way, I think I'm good [10:23:44] ohhhh [10:23:51] sorry for the delay :( [10:25:31] no worries [10:25:58] elukey: Here's what I have: subprocess.check_call (we use that) redirects correctly stdout and stderr to the python parent process [10:26:21] So logging to the correct std stream should do - [10:43:10] 10Analytics: Upload XML dumps to hdfs - https://phabricator.wikimedia.org/T186559#3957904 (10diego) @bmansurov could you try to upload the 20180201 dumps for en,ru,ar,jp,fr,es in parquet_ This is not urgent but might be useful for the section recommendations project. [10:48:59] 10Analytics-EventLogging, 10Analytics-Kanban: Sunset MySQL data store for eventlogging. Find an alternative query interface for eventlogging on analytics cluster that can replace MariaDB - https://phabricator.wikimedia.org/T159170#3957923 (10jcrespo) Have you consider clickhouse? It seems like an interesting o... [10:55:04] elukey: one minute now ? [10:56:36] sure [10:56:43] over irc or batcave ? [10:56:50] as you wish [10:58:06] IRC wuould work better but I can join if you want [10:58:11] k [10:58:40] elukey: I can separate streams corretly for logging between stderr and sdout [10:59:40] elukey: I however won't setup file logging in the script: I can't easily get the output of subprocesses logged to that stream [11:00:09] elukey: So what we'll endup with: stdout has everything from a python logging perspective as well as stdout from subprocesses [11:00:38] elukey: stderr has only ERROR from python logging and stderr from subprocesses [11:01:24] elukey: I can also try to get outpout of subprocesses as strings inside the python process, and then log them, but it's again a bigger change [11:01:24] joal: maybe stderr from the subprocess can go to the stdout of the wrapping script, so we'll control what it will go to stderr [11:01:39] elukey: I can't do that easily [11:02:18] elukey: as of now, subprocesses manage their output, and they are corretly redirected by python [11:02:29] If I want to redirect them, I need to cheat [11:04:56] elukey: I found the cheatery :) [11:07:58] :) [11:32:45] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Patch-For-Review, and 2 others: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3957976 (10elukey) Tested the package on krypton to have a good set of "real" metrics to check: ``` elukey@krypton:~$ curl localhost:8080/metrics... [11:39:34] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Patch-For-Review, and 2 others: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3957980 (10elukey) Main pain points: * on krypton we are using Burrow version 0.0.1, meanwhile upstream is already to 1.0. Packaging 1.0 requires... [11:42:50] Burrow metrics incoming (hopefully) --^ [11:42:51] * elukey brb [11:55:48] (03PS4) 10Joal: Update sqoop-mediawiki-tables script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408930 (https://phabricator.wikimedia.org/T186541) [11:55:52] elukey: Hopefully better --^ [12:05:08] yep it looks good :) [12:05:29] elukey: triple checking now, but should be ok [12:11:28] the testing vk instance to jumbo has been pushing cache misc traffic really smoothly during the past hours [12:11:31] https://grafana.wikimedia.org/dashboard/db/varnishkafka?orgId=1&var-instance=webrequest_jumbo_duplicate&from=now-12h&to=now [12:12:12] I am curious to check latency data, not sure if we have them with prometheus kafka metrics though [12:13:35] ah no as always I am stupid [12:14:05] the metrics that I care about (rtt to brokers etc..) are librdkafka metrics of the client of course, not yet in prometheus [12:19:35] !log Rerun wikidata-articleplaceholder_metrics-wf-2018-2-8 [12:19:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:34:21] Naé needs to get back home from the creche - will be there later this afternoon [12:36:59] o/ [12:37:15] for everybody, not sure if this is correct but better than nothing for the moment [12:37:31] I wanted to check latencies between cp hosts and kafka brokers, for analytics and jumbo [12:37:38] so I picked up the rtt librdkafka metric [12:37:51] it is a graphite metric reporting min/max/avg [12:38:16] so I plotted three graphs for plaintext (port 9092) and three for TLS (port 9093) [12:38:37] avg(min), avg(avg), avg(max) [12:38:43] https://grafana.wikimedia.org/dashboard/db/varnishkafka [12:39:07] it is not super perfect but it is good to see the diff between the webrequest instance and the webrequest-duplicate-jumbo on [12:40:37] as far as I can see all looks good [12:41:01] I thought to use percentiles but it was a bit confusing imho, I wanted one (aggregated) metric for each broker [12:41:42] anyhow, so far the test looks very good :) [12:51:28] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3958096 (10elukey) From IRC: ``` 13:36 I wanted to check latencies between cp hosts and kafka brokers, for analytics and jumbo... [12:55:23] * elukey lunch! [14:49:46] mforns[m]: what's status of eventcapsule change? [14:49:49] can we merge soon? [14:52:07] ottomata: o/ [14:52:16] I am doing some logging experiments on el beta [14:52:21] let me know if I need to revert [14:52:50] hiii [14:52:53] np please do :) [14:53:08] do you have a minute for a quick irc brainstorm? [15:00:12] elukey: sure coming [15:00:21] bc [15:01:02] ottomata: ah ok irc was ok too :D [15:01:26] 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Remove EL capsule from meta and add it to codebase - https://phabricator.wikimedia.org/T179836#3958461 (10Ottomata) [x] Mark https://meta.wikimedia.org/wiki/Schema:EventCapsule as deprecated [x] Populate https://wikitech.wikimedia.org/wiki/Ana... [15:10:13] 2018-02-09 15:10:04,582 [2834] (MainThread) Enqueued an event via confluent-async [15:10:23] ottomata: it logs it if I run it manually [15:10:35] so it might be something weird with the daemon [15:10:37] hey team :] [15:11:35] hello Marcel :) [15:11:45] hello! [15:13:19] ottomata: so I did a eventloggingctl stop and then a start, I can see constant logs in the processors now.. let's seee [15:14:43] oh hm, maybe they just didn't restart properly? [15:14:48] elukey: i dont' trust eventloggingctl restart [15:14:53] i always do stop && sleep 5 && start [15:15:04] hey marcel! [15:15:24] mforns: i'm interested in merging your eventcapsule change [15:15:37] i need to add a field (ip) and woudl rather do it in your code [15:15:56] i added it to schema on meta already, but also updated docs there saying that the meta schema just for reference, no longer canonical [15:15:57] ottomata, sure go ahead [15:16:00] and linked to other places [15:16:01] great! [15:16:20] ottomata, do you want to pair on that? [15:16:21] * elukey wants systemd units for eventlogging and ditch upstart/ubuntu once for all [15:16:51] mforns: , the IP thing? [15:16:54] ottomata, I tested it in beta and all tests pass [15:16:55] yes [15:17:14] mforns: the el side is actually very easy [15:17:15] https://phabricator.wikimedia.org/T186833 [15:17:22] oh already, ok [15:17:36] ok ok [15:18:53] ottomata, later I will add some docs to Wikitech on the new EventCapsule location and format [15:19:24] mforns: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/EventCapsule [15:19:25] :) [15:20:19] ottomata, awesome, thank you! [15:20:38] thank you! this makes the IP change so much simpler [15:21:18] mforns: https://gerrit.wikimedia.org/r/#/c/409350/ [15:21:32] ottomata: I’ve got a wingnut idea that you’ve probably thought about yourself…. If you have a minute? [15:21:36] :] [15:21:38] sure [15:21:39] awight: [15:21:42] wasssup? [15:21:45] hehe [15:22:23] So, the general issue we’re facing with JADE is that we have a multi-master thing happening where most of our entities live in multiple data stores. [15:22:50] I realized that this is a common problem, and sort of what EventBus is designed to help with. [15:23:51] Yesterday, I was daydreaming about a PHP library on the MediaWiki side, and a Python library on the other side, which would hide most of the mechanics of how to synchronize shared data objects across MediaWiki and external data stores. [15:24:03] Has there been work along these lines, that you know of? [15:26:19] awight: not so specifically, but there has been thought about things like that [15:26:49] change prop for html updates previously, and more recently a new program for mw dependency tracking is in the works for a next fy program [15:27:27] kinzler and services I think envision a graph database of dependencies that can be used to ensrue that state changes propogate to all the places they need [15:27:34] whew! [15:27:44] with kafka used as the event mechinism to update the graph db [15:27:49] this is a long way out though [15:27:54] next FY will be mostly brainstorming about how to do this [15:28:09] Thanks for the prior art review :) [15:28:09] i've only just heard about the graph db dep tracking stuff though, so I don't know mucha bout it [15:28:13] but it is early stages ya [15:28:13] :) [15:28:31] The HTML updates are more of a replica than multi-master, but definitely related. [15:28:57] those are yeah, but dep tracking i think is larger view than just html updates [15:29:16] Dependency tracking makes sense to me, luckily our use case is much simpler. Something in between those two initiatives. [15:29:21] yeah [15:29:31] but, if done right, and if you get input into the brainstorming process [15:29:38] the system might work for you and other simliar use cases [15:29:46] ottomata, reviewed your patch, makes sense, but shouldn't we re-implement the code that parses the IP and adds it to the capsule? [15:30:00] you probably won't be the only one with a sharing state (even if writable from multiple locations, e.g. multi master) problem [15:30:05] yup! [15:30:08] well mforns no [15:30:12] we don't need to re-implement any code [15:30:15] that code did hashing [15:30:19] that's the only reason there was special code [15:30:28] we only need to pull the field in [15:30:43] and that can be done with the %i formatter [15:30:59] Yah I’m hoping to find the other people with this problem. I think you saw this diagram, the cycle I’m looking at is in the bottom-center, https://docs.google.com/drawings/d/1Lagl0BJWVWHNvHLy5y6RNNKvl0C1tdVrE5YniwgqFJY/edit [15:31:11] %{ip}i [15:31:13] ottomata, mmmm, but, does the ip have a specific formatter still? didn't we remove it? [15:31:18] oh ok [15:31:29] an this goes in puppet? [15:31:44] ya [15:31:48] k k [15:31:49] its just a string [15:31:56] the only reason it had a specific formatter was to hash it [15:32:03] I see, cool [15:32:36] I see now that the change_prop->JADE user service is a degenerate version of the same problem, where we just host a replica. We’re still using “command query responsibility segregation” there. I’ll try to write something up about what we need. [15:33:19] cool, ya, i won't be thinking too hard about that project, but the event data platform (evnetbus 2.0?) program will support it and be informed by it [15:33:24] ottomata, +1'd, didn't +2 in case you still want to read my comment, but feel free to merge on my side! [15:33:30] so make sure marko and petr and maybe dkinzler know abou tit [15:33:42] mforns: also: https://gerrit.wikimedia.org/r/409354 [15:33:42] :) [15:34:07] mforns: i picked ip because that is what webrequest users [15:34:08] uses [15:34:13] ty [15:34:18] aha [15:35:18] wow mforns this is already better, we can review eventcapsule changes in gerrit [15:35:18] :) [15:35:32] yea :] [15:39:18] ottomata, oh wait, the tests might fail! [15:39:41] ah no, the capsule is not hardcoded in the test, never mind [15:40:15] :D [15:40:17] so great, right! [15:40:26] don't have to modiffy the fixture [15:41:14] i will add fake ip data to test fixture records tho [15:42:51] mforns: in the vk pach [15:42:55] o is response header [15:42:58] i is request header [15:43:07] X-Client-IP is sent set by varnish as response header [15:44:18] response? [15:45:03] wasn't it a request header? [15:45:19] X-Client-IP is parsed by varnish from X-Forwarded-For, and then set as response header [15:45:42] this is also how it was before: https://gerrit.wikimedia.org/r/#/c/275892/2/modules/role/manifests/cache/kafka/eventlogging.pp [15:46:00] (although, it hink the @ip in that old code is a mistake; vk ignores it in non-json output format) [15:46:17] wow it would be really nice to change the vk formatting here though, and change the el processor [15:46:21] to make vk-eventlogging emit json [15:46:42] {"ip": ..., "user_agent":..., uri_query: ... } [15:46:51] then eventlogging processor could just parse uri_query as the `event` field [15:48:28] elukey: you still got beta el patched? [15:48:41] yep [15:48:48] but I can reset [15:49:48] ottomata, understand thx [15:50:31] btw, ottomata, when you're done with this and have a min, I'd like to discuss a functional detail about Hive Purging [15:51:29] sure! [15:51:37] mforns: i have a min, i'm not going to deploy anything today [15:51:39] just getting beta set up [15:51:41] and patches ready [15:51:44] hopefully to deploy on monday [15:51:51] elukey: no hurry at all [15:51:53] ok, so [15:51:58] i can test in afternoon [15:52:12] the way we decided to go for it... [15:52:52] ottomata: feel free to reset whenever you want, it doesn't happen so often that error so I could even need to wait for a day [15:52:53] imagine we have the purging set, and we have 2 tables for each schema, one 90 days only with all data, and the other with sanitized data since the beginning of time [15:53:18] and we discover that one of the tables has a field, that we are not currently purging, but that we should purge [15:53:32] so, we remove it from the whitelist [15:53:56] from now on the data will be 'refined-sanitized' into the sanitized table [15:54:11] but historical data still has the sensitive field populated [15:54:38] we can re-run until 90 days ago, and overwrite sanitized data [15:55:09] but we can not (easily) sanitize older data, because we have no source data... [15:55:45] does this make sense? [15:56:39] no source data... [15:56:41] ? [15:57:07] mforns: not really, isn't most purging just setting fields to null? [15:58:29] ottomata, yes, but it creates a new sanitized table given a source table [15:59:05] wait... batcave? [15:59:14] ya [15:59:16] k [16:08:04] 10Analytics-Kanban, 10ChangeProp, 10EventBus, 10Patch-For-Review, and 2 others: Export burrow metrics to prometheus - https://phabricator.wikimedia.org/T180442#3958798 (10fgiunchedi) >>! In T180442#3957980, @elukey wrote: > Main pain points: > > * on krypton we are using Burrow version 0.0.1, meanwhile up... [16:15:34] ok elukey i'm deploy el in beta [16:16:54] hmm, elukey i'll rebase your change on head, and then deploy, so we keep your buffer stuff running [16:18:21] i did reset your local edits though [16:18:59] 10Analytics, 10Analytics-Wikistats: Wikistats for Wikidata lists several bots as normal users - https://phabricator.wikimedia.org/T59379#3958810 (10Liuxinyu970226) [16:20:34] 10Analytics, 10Analytics-Wikimetrics, 10I18n: Add i18n support - https://phabricator.wikimedia.org/T60634#3958814 (10Liuxinyu970226) [16:25:03] Heya elukey - do you have a minute for me? [16:29:33] ottomata: sorryyyy was in a meeting, please clean up all my mess don't worry [16:29:39] joal: for you always [16:29:42] :) [16:29:48] batcave for am inute? [16:29:58] hmm, i got beta messes too [16:29:59] fdans: yt? [16:30:07] beta puppet has jumbo/tls vk cherry picked [16:30:19] nuria_: hellooo [16:30:21] which means i can't cherry pick this el vk ip change, because i moved from role -> proflie [16:30:46] so i can either put a patch up that works for jumbo, or i can put a patch up that works for analytics, but not both! [16:30:46] haha [16:30:53] oh well, i'll just manually modify vk for now [16:30:56] fdans: one more thing we need to make sure is that all map code is on the 2nd webpack bundle so it is not loaded on teh dashnoard [16:31:02] fdans: but rather on detail page [16:31:06] fdans: makes sense? [16:31:40] nuria_: yep, I got a patch with a few minor changes on the ui, will include that there [16:31:58] joal: ack! [16:32:01] (these are mostly changes that mforns suggested in an email) [16:44:49] (03CR) 10Milimetric: [V: 032 C: 032] "nice, I like the logger approach, I think that means we have to change the cron command to only redirect stdout to the log" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/408930 (https://phabricator.wikimedia.org/T186541) (owner: 10Joal) [17:01:13] ottomata: standupp? [17:03:13] (03PS1) 10Joal: Fix logging bug in sqoop-mediawiki-table script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/409379 [17:03:22] milimetric: --^ [17:04:32] (03CR) 10Milimetric: [V: 032 C: 032] Fix logging bug in sqoop-mediawiki-table script [analytics/refinery] - 10https://gerrit.wikimedia.org/r/409379 (owner: 10Joal) [17:06:18] Today's XKCD is really fun :) [17:06:22] https://xkcd.com/1953/ [17:07:44] git up [17:08:46] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Set up (temporary) IPSec for Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T186598#3958871 (10Ottomata) a:05Ottomata>03elukey [17:23:49] 10Analytics-EventLogging, 10Analytics-Kanban: Monitor and alert if no new data from JsonRefine jobs - https://phabricator.wikimedia.org/T186602#3958904 (10Ottomata) Current idea is to use Spark Accumulators to collect stats about jobs as they go, write them to a hive table, and then generate an alert email whe... [17:24:25] 10Analytics-EventLogging, 10Analytics-Kanban: Hive EventLogging tables not updating since January 26 - https://phabricator.wikimedia.org/T186130#3958907 (10Ottomata) BTW, I believe this should be fixed, please tell me if otherwise. We'll add better monitoring in T186833 [17:48:29] fdans: the "filter" is still there on map page? [17:48:31] fdans: https://stats.wikimedia.org/v2/#/es.wikipedia.org/reading/pageviews-by-country [17:48:45] nuria_: that fix hasn't yet been deployed [17:49:24] fdans: ok, let's wait for that one to be deployed to announce as i think we will get some ux bugs about it [17:50:06] nuria_: my plan was to include it with the rest of minifixes I did yesterday + today [17:50:31] fdans: ok, sounds good, let me know when you have a ptach [17:50:35] *patch [17:52:49] milimetric, elukey : the user/pw for piwik used to be in the stats machine [17:53:22] milimetric, elukey : but i cannot find it any longer [17:53:32] milimetric, elukey : do any of you know where it moved? [17:54:06] there's a copy on my home folder [17:54:08] lemme find it [17:54:49] milimetric: k, thanks [17:55:32] nuria: /home/milimetric/passwords on stat1005 [17:57:54] milimetric: super thanks [19:10:20] 10Analytics: Pageviews/Stats on research.wikimedia.org - https://phabricator.wikimedia.org/T186819#3959151 (10Nuria)