[00:50:14] (03PS11) 10Nuria: Classification of actors for bot detection [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) [06:39:00] !log sto timers on an-coord1001 to facilitate daemon restarts (hive/oozie) [06:39:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:39:04] *stop [06:41:55] !log roll restart druid daemons for openjdk upgrades [06:41:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:44:20] 10Analytics, 10Operations, 10User-Elukey: Refactor Analytics POSIX groups in puppet to improve maintainability - https://phabricator.wikimedia.org/T246578 (10elukey) >>! In T246578#5933951, @nshahquinn-wmf wrote: > > I don't have any serious concerns, so you don't have to pay me too much attention; I joined... [07:02:48] joal: bonjouuurrr [07:03:20] qq to avoid pebcaks - there are some mediawiki history jobs still ongoing, are they dependent on hive? [07:03:49] just to avoid restarting daemons at the worst possible moment [07:04:00] (no rush, when you are online of course!) [07:30:09] 10Analytics, 10Gerrit, 10Gerrit-Privilege-Requests, 10User-MarcoAurelio: Give access to Wikistats 2 to l10n-bot - https://phabricator.wikimedia.org/T245805 (10abi_) @MarcoAurelio - Currently patches submitted by the l10n-bot (https://gerrit.wikimedia.org/r/c/analytics/wikistats2/+/576042) still require a m... [07:41:21] 10Analytics, 10Gerrit, 10Gerrit-Privilege-Requests, 10User-MarcoAurelio: Give access to Wikistats 2 to l10n-bot - https://phabricator.wikimedia.org/T245805 (10fdans) @abi_ thank you for all the help with this. I'm not sure how it works in other projects, but manually merging every patch seems prone to prob... [08:04:14] 10Analytics, 10Gerrit, 10Gerrit-Privilege-Requests, 10User-MarcoAurelio: Give access to Wikistats 2 to l10n-bot - https://phabricator.wikimedia.org/T245805 (10abi_) > If so we can close this task. To clarify, I meant {T240621} >>! In T245805#5935833, @fdans wrote: > @abi_ thank you for all the help with... [08:07:51] good morning [08:08:05] mediawiki_history is actually failing I think - triple checking [08:10:25] elukey: MWH is not using hive - you can restart the hive server :) [08:12:03] joal: ack! [08:12:18] one weird thing - history reduced failed due to druid being roll restarted (by bad) [08:12:32] so I re-run the february coord (that was failed) [08:12:39] and now it seems waiting for a flag [08:12:47] https://hue.wikimedia.org/oozie/list_oozie_coordinator/0028061-191216160148723-oozie-oozi-C/ [08:12:51] nornal ellu [08:12:53] nornal elukey [08:13:15] elukey: snapshot 2020-02 is not yet available [08:13:36] joal: sure but why did it start before then? [08:14:05] elukey: flag for snapshot 2020-01 is there (last month snapshot) [08:14:22] I think I don't get it [08:14:36] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Add multilanguage ability to Wikistats - https://phabricator.wikimedia.org/T238752 (10abi_) [08:14:38] 10Analytics, 10Analytics-Wikistats, 10translatewiki.net, 10Patch-For-Review: Add stats.wikimedia.org to translatewiki.net - https://phabricator.wikimedia.org/T240621 (10abi_) 05Open→03Resolved Going to close this task as this has been done and translations are being pushed out to the repo. There is an... [08:14:45] joal: so in the alerts I found a mail from oozie about "mediawiki-history-reduced-wf-2020-02" [08:14:55] elukey: you have not restarted the coordinator - ok [08:15:06] so I went in https://hue.wikimedia.org/oozie/list_oozie_coordinator/0028061-191216160148723-oozie-oozi-C/ and re-run the failed job, for february [08:15:08] you have restarted the workflow for 2020-02 [08:15:11] I get it [08:15:14] exactly [08:16:38] I may be totally crazy [08:16:51] no no - something very bizarre is hapenning [08:18:40] in the meantime I am proceeding with hive/oozie ok? [08:18:47] yessir [08:19:51] !log restart oozie/hive daemons on an-coord1001 for openjdk upgrades [08:19:52] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:22:29] !log hive metastore/server2 now running without zookeeper settings and without DBTokenStore (in memory one used instead, the default) [08:22:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:23:08] 10Analytics, 10Analytics-Kanban: Prevent detail sidebar from getting too wide - https://phabricator.wikimedia.org/T246744 (10fdans) [08:23:26] 10Analytics, 10Analytics-Kanban: Prevent detail sidebar from getting too wide - https://phabricator.wikimedia.org/T246744 (10fdans) p:05Triage→03High [08:29:12] 10Analytics: Specify in build command which languages to bundle - https://phabricator.wikimedia.org/T246745 (10fdans) [08:29:23] elukey: Interesting - mediawiki-history job failed at first attempt (I think it's driver-memory being the issue), and almost succeeeded at second attempt (computed and wrote files) but failed before closing --> 3rd attempt stated, overwriting existing files and making all other dependent jobs fail [08:30:06] !log Kill mediawiki-history-reduced as it is failing [08:30:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:30:27] !log Correct previsou message: Kill mediawiki-history (not mediawiki-history-reduced) as it is failing [08:30:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:33:34] :* [08:33:37] err :( [08:34:12] !log re-enable timers on an-coord1001 after maintenance [08:34:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:36:00] !log roll restart kafka-jumbo for openjdk upgrades [08:36:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:36:37] 10Analytics, 10Analytics-Kanban: Make mediawiki-history spark jobs single-attempt - https://phabricator.wikimedia.org/T246747 (10JAllemandou) [08:36:53] 10Analytics, 10Analytics-Kanban: Make mediawiki-history spark jobs single-attempt - https://phabricator.wikimedia.org/T246747 (10JAllemandou) a:03JAllemandou [08:37:11] (03PS1) 10Joal: Make mediawiki-history spark jobs single-attemps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576275 (https://phabricator.wikimedia.org/T246747) [08:44:23] 10Analytics, 10Analytics-Kanban: Remove stats gathering for mediawiki_history production job - https://phabricator.wikimedia.org/T246748 (10JAllemandou) [08:44:37] 10Analytics, 10Analytics-Kanban: Remove stats gathering for mediawiki_history production job - https://phabricator.wikimedia.org/T246748 (10JAllemandou) a:03JAllemandou [08:44:57] (03PS1) 10Joal: Remove stats gathering in prod mediawiki_history [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576277 (https://phabricator.wikimedia.org/T246748) [08:45:25] elukey: could you please triple check the 2 CR above? [08:46:00] elukey: my plan is to manually run a job with the settings above proving it solves our issue, and let others review [08:47:48] (03CR) 10Elukey: [C: 03+1] Make mediawiki-history spark jobs single-attemps [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576275 (https://phabricator.wikimedia.org/T246747) (owner: 10Joal) [08:49:48] joal: one question - where does the --no-stats comes from? [08:50:32] elukey: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/mediawikihistory/MediawikiHistoryRunner.scala#L145 [08:50:46] also another thing that I want to do is to upgrade all spark actions in refinery and remove the common options with spark-defaults, since oozie now reads from the file [08:50:51] 10Analytics, 10affects-Kiwix-and-openZIM: "Month over month" i18n tag being mixed with locales - https://phabricator.wikimedia.org/T246750 (10fdans) [08:50:59] (03CR) 10Elukey: [C: 03+1] Remove stats gathering in prod mediawiki_history [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576277 (https://phabricator.wikimedia.org/T246748) (owner: 10Joal) [08:51:01] makes sense elukey [08:51:07] 10Analytics, 10Analytics-Kanban: "Month over month" i18n tag being mixed with locales - https://phabricator.wikimedia.org/T246750 (10fdans) [08:51:20] 10Analytics, 10Analytics-Kanban: "Month over month" i18n tag being mixed with locales - https://phabricator.wikimedia.org/T246750 (10fdans) a:03fdans [08:53:14] (03PS1) 10Fdans: Add the ability to manually select locales to build [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576278 (https://phabricator.wikimedia.org/T246745) [08:55:02] Hi awight - I have our patch yesterday - it makes sense :) [08:56:48] joal: Ah thank you for taking a look. It times out like crazy, could that be related to the heavy load on stat1007, or is that machine separate from the hadoop cluster? [08:57:24] awight: the cluster was very busy yesterday (probably will be again today) - Beginning of month [08:57:34] awight: and possibly stat1007 was busy as well [08:58:27] awight: also, the join you're making is huge - it feels normal it takes time [08:58:38] Okay, good to know. I need to run an explain to understand whether it's doing 3 scans of the mediawiki_history table, or just the 1 [08:58:50] awight: it'll do 3 scanas for sure [08:58:57] /o\ [08:59:32] In that case, maybe I should add another step, which builds a list of all revisions that I want to pull, as a single set. [09:00:13] awight: I tried to do that, and it's also long - probably a good apprach nonetheless [09:00:55] awight: I checked the conflicts data: 5k revisions - we can collect that and make a filter [09:01:09] neat. I'll do it just for my own education. The OOM exception suggests I might have bigger problems, though. [09:01:20] :S [09:01:29] awight: let me know how it moves :) [09:01:50] joal: I was wondering if that was possible--just filter for "rev_id in " rather than joining just to throw out the left-hand table rows. [09:02:03] I think I can figure that out, now that I know it's a thing :-) [09:04:33] awight: possible, but since you're after wiki_db + rev_id, I'd rather do a last of (rev_id = XX and wiki_db = YYY) OR ... [09:05:12] nice approach! [09:10:23] awight: https://gist.github.com/jobar/03551c18c287a700fe7637a74ffc843b [09:10:39] not tested --^ [09:13:42] oh that is funky! [09:14:30] have we ever explored https://hadoop.apache.org/docs/r2.8.5/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#ACLs_Access_Control_Lists ? [09:14:36] they look really neat [09:14:40] I don't think we have elukey [09:15:01] IIUC it would solve the issue of multiple groups having rwx access to the same files [09:15:23] there is one owner/group as always, but then an ACL with user/groups that can do rwx [09:15:38] ooooh [09:16:03] !log Manually restarting mediawiki-history-denormalize with new patch to try [09:16:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:16:59] aI've decided that I don't need to care about base revision details older than a few days; that should help quite a bit. [09:17:15] awight: it will for sure! [09:17:22] moar filtering! [09:18:09] ! Rerunning failed mediawiki-history jobs for 2020-02 after mediawiki-history-denormalize issue (mediawiki-history-dumps-coord, [09:19:36] (03PS2) 10Fdans: Add the ability to manually select locales to build [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576278 (https://phabricator.wikimedia.org/T246745) [09:20:01] (03PS1) 10Fdans: Fix i18n key mismatch - month over month [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576282 (https://phabricator.wikimedia.org/T246750) [09:20:39] 10Analytics: geoeditors-yearly job times out - https://phabricator.wikimedia.org/T246753 (10JAllemandou) [09:22:24] !log Rerunning failed mediawiki-history jobs for 2020-02 after mediawiki-history-denormalize issue [09:22:25] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:25:07] (03CR) 10Joal: Stop using the jar file in the WikidataArticlePlaceholderMetrics (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [09:26:26] I'm not sure whether it's less efficient to keep the join? https://gitlab.com/snippets/1946888 [09:27:34] awight: rethinking about that, I'd go for: get wiki_db+rev_id of base + other [09:28:13] awight: Extract base_parent wiki_db+rev_id (long, maybe we can use revision_create to read less data for this?) [09:28:47] awight: you then have your global filter by unioning the 3 (wiki_db+rev_id) datasets, and you can gather metadata [09:29:00] awight: the later join can be done locally I assume (a lot easier) [09:30:48] awight: does that sound reasonable? --^ [09:31:15] And you think I should build one giant OR clause to filter metadata rows? Or should I build a dataframe of the (wiki_db, rev_id) pairs and join? [09:32:01] I'd be interested to see which is faster :) [09:32:36] I think join would be (smaller query-plan, and local-join optimization), but who knows [09:32:40] awight: --^ [09:32:50] Let's try the join approach first [09:33:13] Thanks for taking the time to educate me about this paradigm... Hopefully the outcome is better wikis :-) Here are two mysteries which sent me down this rabbithole in the first place, T246439 T246440 [09:33:14] T246440: High proportion of edit conflicts seem to not involve a conflicting edit - https://phabricator.wikimedia.org/T246440 [09:33:14] T246439: High proportion of edit conflicts seem to come from new article creation - https://phabricator.wikimedia.org/T246439 [09:33:33] It's possible that > 2/3 of edit conflicts really shouldn't be happening! [09:34:21] wow awight - I have too many rabbit holes in concurrent explication, so I won't jump into that one - but that is definitely appealing!!!! [09:34:30] +1 don't look now :-) [09:49:21] 10Analytics, 10Operations, 10Research, 10Traffic, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10elukey) [10:10:28] The internets say that Spark's saveTableAs creates tables which aren't compatible with Hive. That's unfortunate, if true. [10:11:33] very possible awight - we usually write files in parquet for instance with Spark, and then add hive partitions using oozie [10:14:39] (03PS1) 10Fdans: Prevent sidebar from widening too much [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576289 (https://phabricator.wikimedia.org/T246744) [10:14:56] 10Analytics, 10User-Elukey: Investigate Hadoop HDFS ACLs - https://phabricator.wikimedia.org/T246755 (10elukey) [10:15:47] 10Analytics, 10Operations, 10Research, 10Traffic, and 2 others: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10elukey) >>! In T245833#5934803, @leila wrote: > > @Miriam @elukey the layered permission system can have internal use-cases,... [10:31:23] ty, I'll follow that pattern then [10:42:22] 22 seconds \o/, and there's data written. Now to validate. [10:42:36] wow - this feels fast :) [10:50:33] !log restarted kafka jumbo (kafka + mirror maker) for openjdk upgrades [10:50:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:15:27] joal: after some hours of swearing, I realized that 'percent' and 'range' are reserved keywords from Hive 1.2.0 onward, so they need `` to work (otherwise hive will emit some weird syntax errors that are cryptic) [11:15:38] but, now refine works in hadoop test :D :D :D [11:15:43] \o/ [11:15:48] * elukey dances [11:16:04] * joal bows to elukey master of bigtop [11:18:48] I'll look for reserved keywords in refine and send a code review in case [11:19:13] `` should work with our hive version too, in theory [11:19:16] will also check that [11:19:31] joal: so far everything works [11:20:00] today I'll do another round of checks, then if you have patience/time to do a round of checks as well later on in the week [11:20:03] it would be massive [11:20:20] HDFS is still not finalized, so I can rollback to prev version [11:20:40] once we are confident that bigtop works, I'll roll back to cdh to see what explodes [11:20:44] then roll forward again [11:20:51] and then possibly finalize [11:20:58] what do you think? [11:21:31] feels very possible elukey :) [11:21:52] I am still a bit skeptic of hue [11:22:09] but we'd need to package the last py3 upstream version anyway when ready [11:22:23] elukey: let's sync when you want to make sure I understand what you expect from me to test ) [11:31:40] joal: sure, in general I'd like your usual super comprehensive set of tests to understand what I missed :D [11:31:48] huhu :) [11:32:00] this time might require time, so please do it whenever you are good with other priorities [11:32:11] we have done a lot more than needed for bigtop up to now [11:32:16] not urgent [11:32:50] so looking forward to move to bigtop [11:32:51] :D [11:33:16] \o/ [11:33:36] I'm kinda busy now, but will the tests in mind elukey [11:40:47] * elukey lunch! [11:53:56] (03PS12) 10Joal: Classification of actors for bot detection [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) (owner: 10Nuria) [11:55:58] !log Kill actor-hourly tests [11:55:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:56:10] !log Kill actor-hourly oozie test jobs (precision of previous message) [11:56:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:14:21] (03PS13) 10Joal: Classification of actors for bot detection [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) (owner: 10Nuria) [12:31:32] (03PS14) 10Joal: Classification of actors for bot detection [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) (owner: 10Nuria) [13:27:17] (03PS3) 10Fdans: Add the ability to manually select locales to build [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576278 (https://phabricator.wikimedia.org/T246745) [13:44:52] (03PS4) 10Fdans: Add the ability to manually select locales to build [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/576278 (https://phabricator.wikimedia.org/T246745) [13:57:50] hello elukey ! i'm going to do LVS stuff for eventgate-analytics-external this morning! [13:57:54] want to do it with me? [13:58:23] ottomata: hello! I am about to roll restart Elastic Search in codfw with Gehel, sorry :( [13:58:27] ok! np [14:06:38] elukey, just saw your msg from yesterday (?) [14:07:23] dsaez: hello! nothing major, stat1007 was a little bit under pressure and there were java processes running under your user with a ton of ram used :) [14:07:50] I deployed a new change to systemd though that should automatically take care of these situations [14:07:53] (in theory) [14:08:02] because alarms are firing etc.. when the host is under pressure [14:11:56] got it...that might be spark, because I don't run other java process [14:12:37] I was having some RAM issues, so I've increased the master RAM allocation. Maybe was that, I'll decrease to avoid future problems. [14:20:44] 10Analytics: Use version of Lato that renders non-roman alphabets - https://phabricator.wikimedia.org/T246777 (10fdans) [14:23:22] 10Analytics: Drastically reduce build time for languages - https://phabricator.wikimedia.org/T246778 (10fdans) [14:29:56] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Event-Platform, and 6 others: Public EventGate instance and endpoint for analytics event intake: eventgate-analytics-external - https://phabricator.wikimedia.org/T233629 (10Ottomata) [15:02:36] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10serviceops, 10Patch-For-Review: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [15:13:00] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10serviceops: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [15:19:22] (03PS2) 10Mforns: Add dimensions to druid pageview_hourly and virtualpageview_hourly [analytics/refinery] - 10https://gerrit.wikimedia.org/r/570681 (https://phabricator.wikimedia.org/T243090) [15:26:34] afk for a bit! [15:46:59] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban: Hadoop Hardware Orders FY2019-2020 - https://phabricator.wikimedia.org/T243521 (10Ottomata) [16:00:13] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban: Hadoop Hardware Orders FY2019-2020 - https://phabricator.wikimedia.org/T243521 (10Ottomata) [16:00:31] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban: Hadoop Hardware Orders FY2019-2020 - https://phabricator.wikimedia.org/T243521 (10Ottomata) [16:08:07] Hey, https://analytics.wikimedia.org/published/datasets/one-off/ladsgroup/ doesn't show the file I put in stat1007, how often the rsync happen? Should I manually trigger it? I can wait I just don't know for how long [16:09:59] Amir1: 30 minutes-ish? [16:10:09] cool [16:10:13] Thanks [16:18:59] 10Analytics: Errors on wikistats UI on console - https://phabricator.wikimedia.org/T246789 (10Nuria) [16:20:12] (03CR) 10Nuria: [C: 03+1] Stop using the jar file in the WikidataArticlePlaceholderMetrics (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [16:29:19] (03CR) 10Nuria: [C: 03+1] Remove stats gathering in prod mediawiki_history [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576277 (https://phabricator.wikimedia.org/T246748) (owner: 10Joal) [16:37:51] 10Analytics, 10Gerrit, 10Gerrit-Privilege-Requests, 10User-MarcoAurelio: Give access to Wikistats 2 to l10n-bot - https://phabricator.wikimedia.org/T245805 (10MarcoAurelio) >>! In T245805#5935811, @abi_ wrote: > @MarcoAurelio - Currently patches submitted by the l10n-bot (https://gerrit.wikimedia.org/r/c/a... [16:37:59] !log restarted turnilo to refresh deleted test datasource [16:38:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:38:04] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Patch-For-Review, 10Services (watching): Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 (10Pchelolo) [16:38:46] ottomata: staff now deployed to test wiki: [16:38:50] https://www.irccloud.com/pastebin/sS8hlHX4/ [16:39:36] OH [16:39:39] was it just cached?!? [16:39:40] ottomata: where can we see what that event gate instance is ingesting? [16:39:46] my change hasn't been deployed afaik [16:39:50] ottomata: no idea [16:39:57] hang on, in thhe middle of somethign with petr [16:40:02] ottomat: k [16:44:18] 10Analytics, 10Gerrit, 10Gerrit-Privilege-Requests, 10Patch-For-Review, 10User-MarcoAurelio: Give access to Wikistats 2 to l10n-bot - https://phabricator.wikimedia.org/T245805 (10MarcoAurelio) 05Resolved→03Open Reopening due to new patch for review. [16:50:16] nuria: we shoudl be able to consume from kafka [16:50:34] ottomata: kafka labs? [16:50:56] no in prod, the logstash kafkas [16:51:49] ottomata: ah wait, testwiki is on prod [16:51:51] ottomata: maybe you needed to sync twice? T236104 [16:51:51] T236104: Cache of wmf-config/InitialiseSettings often 1 step behind - https://phabricator.wikimedia.org/T236104 [16:52:02] Amir1: ha maybee. [16:52:52] if you deploy anything with IS.php you need to sync twice. This has the potential to bring down everything. Sent some emails about it already [16:56:50] nuria: i'm consuming from that kafka topic [16:56:54] not sure how to trigger an error [16:57:13] we should also be able to see an outgoing network request to the UURL [17:19:37] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10serviceops: Create production and canary releases for existent eventgate helmfile services - https://phabricator.wikimedia.org/T245203 (10Ottomata) [17:19:52] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Patch-For-Review, 10Services (watching): Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 (10Ottomata) [17:20:33] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Patch-For-Review, 10Services (watching): Switch all eventgate clients to use new TLS port - https://phabricator.wikimedia.org/T242224 (10Ottomata) [17:29:14] ottomata: * i think* this should work to send errors: https://gist.github.com/nuria/0255482f674c2a9f7449ac21b14320df [17:29:36] ottomata: i might be missing something big time [17:29:48] but that certainly puts errors on the queue [17:30:21] ottomata: and mw events is just subscribed to that: https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/master/modules/ext.wikimediaEvents/clientError.js#L28 [17:31:57] (03CR) 10Nuria: [C: 03+1] Classification of actors for bot detection (0311 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) (owner: 10Nuria) [17:32:50] hm interesting, doesn't seem to do much... [17:33:04] hip: any idea how to trigger a client error on test.wikipedia.org [17:33:04] ? [17:35:33] ottomata: let me see, let's try to send a plain event just to make sure the intake endpoint can ingest it [17:35:38] ottomata: after staff [17:35:40] k [17:36:20] nuria: i'msure the endpoint works, i can post from CLI via curl [18:13:02] 10Analytics: Add historical page protection status to MediaWiki history - https://phabricator.wikimedia.org/T246723 (10Milimetric) Thanks @Halfak, this is great, I knew about page protections but the linked paper is useful for details. We'll triage Thursday [18:19:42] 10Analytics, 10Operations: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) 05Open→03Resolved [18:19:45] 10Analytics, 10Operations, 10hardware-requests: Refresh or replace oxygen - https://phabricator.wikimedia.org/T181264 (10RobH) [18:19:50] Hello. rsync between notebooks and stats machines is not working [18:21:48] djellel: yep people are using too much disk space [18:22:06] elukey: where ? [18:22:21] djellel: ah sorry I misread! [18:22:33] between stat and notebooks [18:22:51] there are alarms for disk space usage on notebooks, so I made the connection (wrong one) [18:23:05] yes you are right for the moment rsync works only from a stat to a stat [18:23:12] yes, I need to move my env out of notebooks [18:23:32] we could change it, not sure why we decided to avoid this in the first place [18:23:46] ottomata: anything against allowing rsync between stat and notebooks? [18:24:03] elukey: nopers, i thought it already was allowed [18:24:51] it is yes, I just checked on stat1007 [18:24:57] I was convinced otherwise [18:25:00] stat1005 [18:25:20] should be the same, checking [18:25:33] yeah I checked /etc/rsync.d/frag-home [18:25:36] notebooks are there [18:25:46] djellel: what command are you using? [18:26:11] hm btw i think only pull rsync is allowedk right? [18:26:21] IIRC yes [18:26:27] https://www.irccloud.com/pastebin/Ytuqjk3M/ [18:27:47] ahhh yes [18:27:55] ottomata: on notebooks rsync doesn't allow pull from stat [18:28:00] but on stats, it is allowed [18:28:15] huh [18:28:17] why not? [18:28:17] hm [18:28:21] hosts allow = notebook*.*.wmnet localhost [18:28:23] no idea [18:28:28] it must be a puppet thing [18:28:45] there is a profile::swap::rsync_hosts_allow [18:28:49] we can add stat boxes [18:29:13] sending cr [18:30:18] djellel: give us 5 mins :) [18:31:24] ottomata: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576409/ [18:32:35] +1 [18:43:25] djellel: can you retry ? [18:51:07] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Add wikidata item_page_link oozie job [analytics/refinery] - 10https://gerrit.wikimedia.org/r/572834 (https://phabricator.wikimedia.org/T244707) (owner: 10Joal) [18:52:31] djellel: going afk for now but will read later :) [18:52:34] * elukey off! [18:56:20] (03PS8) 10Milimetric: Add wikidata item_page_link spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572746 (https://phabricator.wikimedia.org/T244707) (owner: 10Joal) [18:56:41] (03CR) 10Milimetric: [C: 03+2] Add wikidata item_page_link spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572746 (https://phabricator.wikimedia.org/T244707) (owner: 10Joal) [19:01:39] (03Merged) 10jenkins-bot: Add wikidata item_page_link spark job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572746 (https://phabricator.wikimedia.org/T244707) (owner: 10Joal) [19:11:30] (03PS1) 10Milimetric: Fix bad edit_hourly query [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576420 [19:12:55] Hi team [19:13:06] Thanks milimetric for the merges :) [19:13:16] milimetric: shall I merge the edit-hourly one? [19:14:01] joal: I was just looking and I'm confused the edit_hourly coord was running until a second ago and it shows killed now [19:14:07] but it said it was waiting not failed... [19:14:09] elukey: still not, just tried [19:14:17] meh :( [19:14:49] ok milimetric - mwh-denormalized-corrected finished, so ediut-hourly ran [19:15:49] joal: ah! Ok, so I just tested the new query, it works fine, should be ok to merge and deploy with either train or otherwise. I'll add it to the etherpad nonetheless [19:16:04] ack milimetric - will merge it now [19:16:14] (03CR) 10Joal: [V: 03+2 C: 03+2] "Indeed! Sorry for not catching that in previous patch :(" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576420 (owner: 10Milimetric) [19:17:08] nuria: shall we try and log an error? [19:20:02] (03CR) 10Joal: Add wikidata item_page_link spark job (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572746 (https://phabricator.wikimedia.org/T244707) (owner: 10Joal) [19:26:28] fdans: do you mind if I tchu-tchu today? [19:29:36] (03PS2) 10Joal: Fix WikidataArticlePlaceholderMetrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [19:31:31] (03PS3) 10Joal: Fix WikidataArticlePlaceholderMetrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [19:31:52] (03PS1) 10Joal: Move wikidata jobs in the wikidata package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/576424 [19:34:23] ottomata: ya, sorry, i missed this before [19:35:01] nuria: aye ok [19:35:08] i'm not totally sure what to try [19:35:12] (03CR) 10Nuria: [C: 03+2] Fix WikidataArticlePlaceholderMetrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [19:36:01] joal: i merged the wiidata one as well, right? (i was the one holding teh cr) [19:36:05] Thanks nuria :) [19:36:12] ottomata: ok, let's first try to post plainly to the endpoint [19:36:17] ottomata: without the client [19:36:29] ottomata: do you have a test case handy that matches the error schema [19:36:36] nuria: I'm also taking advantage of the work in that space to move the 2 jobs left to the wikidata package (see https://gerrit.wikimedia.org/r/572746 (https://phabricator.wikimedia.org/T244707) (owner: Joal) [19:36:41] 20:26:28 < joal> fdans: do you mind if I tchu-tchu today? [19:36:45] mwrf --^ [19:36:46] sorry [19:37:03] so we post to the *intake endpoint, makes sense? [19:37:05] nuria: if ok for you I'll just self-merge and bump in jobs [19:37:15] (03CR) 10Nuria: [V: 03+2 C: 03+2] Fix WikidataArticlePlaceholderMetrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [19:37:21] \o/ [19:37:33] joal: that one i meant to merge, sorry [19:37:39] np [19:37:51] nuria: yes [19:37:56] we can do that very easily from CLI [19:38:00] ottomata: can you cut & paste? [19:38:06] ya [19:38:14] nuria: last thing - ok for me to merge bots? [19:38:24] joal: OUR BOTS? [19:38:27] nuria: we'll see how to handle disk space [19:38:29] sure :) [19:38:40] nuria: YOUR bots :) [19:38:51] joal: yes, but all tables need to be redone and data rerun, the one there now [19:38:52] joal: yeah we agreed that we would move the train to tue no? Just don’t deploy wikistats pls [19:38:55] joal: has a bug [19:39:25] fdans: ok, train-ing tonight - thanks - no wikistats, I'll do it when you tell me to :) [19:39:46] nuria: no bug AFAIK, which are you thinking of? [19:40:22] (03Merged) 10jenkins-bot: Fix WikidataArticlePlaceholderMetrics [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/572734 (https://phabricator.wikimedia.org/T236895) (owner: 10Ladsgroup) [19:40:27] joal: yes, the min and max are transposed, added instructions here: https://etherpad.wikimedia.org/p/analytics-weekly-train [19:40:49] nuria: right - I kept data, but I am recomputing since 2020-01-01 [19:40:54] HMMMM [19:41:00] joal: did you see mediawiki-geoeditors-monthly-wf [19:41:05] it failed due to the UDF [19:41:08] joal: ah ya, that too [19:41:10] nuria: I'll check how much we have [19:41:27] milimetric: failed again? [19:41:41] milimetric: that does not make sense [19:41:42] 13 hours ago, luca emailed, wasn't sure if you saw it [19:41:53] milimetric: does it have the latest jar version? [19:42:06] ah [19:42:07] hm [19:42:13] not sure... [19:42:16] ottomata: found an example? [19:42:20] ya [19:42:21] curl -v -H 'Content-Type: text/plain' -d'{"$schema": "/mediawiki/client/error/1.0.0", "meta": {"stream": "mediawiki.client.error"}, "message": "test event", "type": "TEST", "url": "http://otto-test.org", "user_agent": "otto test"}' 'https://intake-logging.wikimedia.org/v1/events?hasty=true' [19:42:31] sorry, had the Content-Type wrong for a min and was confused [19:42:37] that is the Content-Type that sendBeacon will use [19:42:42] milimetric: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0024956-191216160148723-oozie-oozi-C/ [19:43:16] nuria: you should be able to paste that into CLI on your laptop and get events in [19:43:25] i'm tailing kafka topic now, so if tyou do I will see it [19:43:26] milimetric: that's why I thought - error said UDF, but the real issue was actually that data wasn't available anymore [19:44:48] ottomata: k, will send a few , this is kafka main, correct? [19:44:57] no [19:44:59] kafka logging [19:45:04] different akfka cluster [19:45:08] e.g. logstash1010 [19:45:10] not sure if you have access there [19:45:13] joal: oh ok, so you fixed it. I get confused trying to join the irc backscroll to the emails... we should probably standardize on how we handle these things [19:45:28] ah i see, but i shoudl see topic with some messages in graphana in a few mins at least [19:45:30] agreed milimetric - I should have made it clearer [19:45:35] yes [19:45:37] i guess ya [19:45:56] also hhere [19:45:58] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-logging-external&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All [19:46:47] or in Kafka by Topic [19:46:47] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&var-topic=eqiad.mediawiki.client.error&from=1583263902123&to=1583264802123 [19:46:55] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deloy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576277 (https://phabricator.wikimedia.org/T246748) (owner: 10Joal) [19:47:19] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merge for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576275 (https://phabricator.wikimedia.org/T246747) (owner: 10Joal) [19:48:04] (03PS1) 10Joal: Bump wikidata jobs jar version and update package [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576430 [19:48:38] (03CR) 10Joal: [C: 03+2] "Merging for dpeloy" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/576424 (owner: 10Joal) [19:49:03] ottomata: ok, so this test works just fine, we can post to the intake point, let's try to do it from the mw client [19:50:03] aye that is what I don't know how to do :) [19:50:07] mw.track i guess isn't workign right? there? [19:52:33] (03Merged) 10jenkins-bot: Move wikidata jobs in the wikidata package [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/576424 (owner: 10Joal) [19:54:16] nuria: did you POST that event from CLI? [19:54:16] (03PS1) 10Joal: Update changelog.md to v0.0.117 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/576432 [19:54:25] milimetric: if you have a minute --^ [19:54:26] ottomata: ya, i posted like 10 or so [19:54:35] hmmm i don't see them [19:54:46] ottomata: on teh topic? [19:54:49] yeah [19:54:54] remove the ?hasty=true bit [19:55:01] what is your http response status code? [19:55:09] (03CR) 10Milimetric: [C: 03+2] Update changelog.md to v0.0.117 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/576432 (owner: 10Joal) [19:55:47] Thnanks :) [19:55:49] (03PS15) 10Joal: Classification of actors for bot detection [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) (owner: 10Nuria) [19:56:28] ottomata: i see messages on topic on the dashboard you sent [19:56:36] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/562368 (https://phabricator.wikimedia.org/T238361) (owner: 10Nuria) [19:57:20] ottomata: the ones i am sending from cli [19:57:38] ottomata: i am trying to find wikimediaevents on test wiki to debug the send beacon call there [20:00:10] (03Merged) 10jenkins-bot: Update changelog.md to v0.0.117 [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/576432 (owner: 10Joal) [20:00:58] ottomata: for the life of me that i cannot find the wikimedia events code that i was able to find yesterday [20:01:00] !log Release refinery-source v0.0.117 with Jenkins [20:01:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:01:31] you mean in the browser? [20:01:41] ottomata: ya [20:01:46] nuria i have seent somet messages too [20:01:53] can you do it right now so I am watching wheen you hit send [20:01:57] ottomata: cause from the cli we can get things on teh topic [20:02:01] also do it without the ?hasty=true [20:02:03] bit on the URL [20:02:09] ottomata: just did [20:02:13] ottomata: k [20:02:22] you sholud get a relevant response code [20:02:23] 201 [20:02:25] withhout hasty=true [20:02:49] ottomata: yes [20:02:55] you ugot 201??? [20:03:01] why don't I see an event.... [20:03:03] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/576430 (owner: 10Joal) [20:03:18] ottomata: [20:03:21] https://www.irccloud.com/pastebin/ZlXfTdBF/ [20:03:44] hehee, nura can you uchange one of thhe fields to have your naame in it instead of mine? [20:04:14] this then [20:04:15] curl -v -H 'Content-Type: text/plain' -d'{"$schema": "/mediawiki/client/error/1.0.0", "meta": {"stream": "mediawiki.client.error"}, "message": "NURIA EVENT", "type": "TEST", "url": "http://nuria-test.org", "user_agent": "nuria test"}' 'https://intake-logging.wikimedia.org/v1/events' [20:04:17] just do ^ [20:04:35] ottomata: just send some with "queen of napkin math" [20:04:46] yeah don't see them!!!! [20:04:59] ottomata: well, we found problem number 1 [20:05:04] but mine work.... [20:05:06] why??? [20:05:47] ottomata: they gotta be working cause i get 201s [20:05:53] ottomata: maybe they rae not valid [20:05:54] *are [20:07:22] no you wouldn't get 201 [20:08:21] ok nuria can you do the exact same command [20:08:28] but from deploy1001.eqiad.wmnet [20:08:28] ? [20:10:03] also, maybe we should batcave [20:10:07] so you can share your screen [20:10:09] something is weird [20:10:24] k [20:10:27] bc [20:14:36] 10Quarry, 10observability: Develop the monitoring of Quarry - https://phabricator.wikimedia.org/T205150 (10Framawiki) Just a quick note, we should really have some metrics, including query error rate. There are user concerns today about it, but we don't have any simple way to get number to see eventual inciden... [20:28:32] !log Add new jars to refinery using Jenkins [20:28:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:34:00] !log Deploy refinery using scap [20:34:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:42:58] * mforns getting internet cuts because of strong winds :[ [20:46:25] !log Deploy refinery onto HDFS [20:46:26] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:04:09] !log Start wikidata_item_page_link coordinator [21:04:10] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:06:00] !log Kill/restart edit_hourly job [21:06:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:07:12] !log Start Wikidataplaceholder job [21:07:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:07:18] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Desktop Improvements, and 7 others: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 (10Ottomata) Hm, ok `navigator.sendBeacon` from the brower to intake-logging.wikimedia.org works. On test.wikiped... [21:08:29] !log Kill restart wikidata-specialentitydata_metrics-coord [21:08:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:10:56] !log Kill Wikidataplaceholder failling coord [21:10:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:14:36] (03PS1) 10Joal: Fix wikidata article-placeholder job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/576449 (https://phabricator.wikimedia.org/T236895) [21:16:00] !log Kill-restart mediawiki-history job [21:16:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:17:13] !log kill-restart mediawiki-history-check_denormalize-coord [21:17:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:19:32] !log Kill-restart actor jobs [21:19:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:21:53] ok team - deploy done, still 4 jobs to kill-restart (currently in flight so it'll wait tomorrow morning), and 1 failed (see patch above) [21:21:58] Gone for tonight :) [21:43:18] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Desktop Improvements, and 7 others: Enable client side error logging in prod for small wiki - https://phabricator.wikimedia.org/T246030 (10Tgr) Works for me as expected: `lang=javascript > mw.trackSubscribe( 'global.error', (...args) => console.log(... [22:14:08] 10Analytics: Requesting Kerberos authentication principal for Michael Holloway - https://phabricator.wikimedia.org/T246834 (10Mholloway) [22:41:30] 10Analytics, 10Event-Platform, 10Product-Analytics, 10CPT Initiatives (Modern Event Platform (TEC2)): Eventbus revisions are duplicated in event.mediawiki_revision_tags_change - https://phabricator.wikimedia.org/T218246 (10nettrom_WMF) I came across this issue when looking to use `mediawiki_revision_tags_c... [23:02:55] (03CR) 10Ladsgroup: [C: 03+1] Fix wikidata article-placeholder job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/576449 (https://phabricator.wikimedia.org/T236895) (owner: 10Joal)