[05:21:02] 10Analytics-Clusters, 10Analytics-Radar: Requesting account expiration extension - https://phabricator.wikimedia.org/T183291 (10Nuria) To keep archives happy, WMF did teh work of productionizing these scripts: https://wikitech.wikimedia.org/wiki/Analytics/Data_quality/Traffic_per_city_entropy [05:49:17] 10Analytics, 10Anti-Harassment, 10Event-Platform: SpecialInvestigate Event Platform Migration - https://phabricator.wikimedia.org/T267349 (10Niharika) >>! In T267349#6606755, @Ottomata wrote: > @Niharika > Let us know if this schema needs client IP and/or geocoded data? If not, it will be removed as part of... [05:51:19] 10Analytics, 10Anti-Harassment, 10Event-Platform: CookieBlock Event Platform Migration - https://phabricator.wikimedia.org/T267341 (10Niharika) >>! In T267341#6606747, @Ottomata wrote: > @Niharika > Let us know if this schema needs client IP and/or geocoded data? If not, it will be removed as part of this... [06:03:56] 10Analytics, 10Anti-Harassment, 10Event-Platform: SpecialMuteSubmit Event Platform Migration - https://phabricator.wikimedia.org/T267350 (10Niharika) >>! In T267350#6606756, @Ottomata wrote: > @Niharika > Let us know if this schema needs client IP and/or geocoded data? If not, it will be removed as part of... [06:06:39] 10Analytics, 10Anti-Harassment, 10Event-Platform: AutoblockIpBlock Event Platform Migration - https://phabricator.wikimedia.org/T267340 (10Niharika) >>! In T267340#6606746, @Ottomata wrote: > @Niharika > Let us know if this schema needs client IP and/or geocoded data? If not, it will be removed as part of... [06:31:42] 10Analytics-Clusters: Create kafka-jumbo mirror cluster - https://phabricator.wikimedia.org/T268074 (10elukey) >>! In T268074#6628851, @razzi wrote: > Questions: > > - new zookeeper cluster or reuse a zookeeper cluster? A possibility that we discussed on IRC could be to co-locate zookeeper on the nodes, like w... [07:26:07] hello people, I am going to the dentist now, talk with you later on! [07:31:17] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Xover) Just to add a perspective… A guarantee that data is always sanitised is an i... [09:22:44] !log set dns_canonicalize_hostname = false to all kerberos clients [09:22:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:24:07] Hi elukey - I'm sorry to have missed your prez yesterday :( [09:24:44] joal: bonjour :) please don't say that, it is fine, you had other things to do :) [09:24:46] elukey: I'm also sorry not to have provided my tests for kerberos-hive-oozie in a timely manner [09:25:16] elukey: I have seen you solved that with Marcel - That's good - But I should have been keeping my word [09:26:43] joal: that is my bad too, you were not on-call, I asked to Marcelo instead :) [09:26:59] everything looks good! Browser general is running fine, so I think we can proceed with moar tests [09:27:14] I am not merging the change for kerberos clients to avoid resolving cnames [09:27:31] \o/ [09:27:43] This is great :) [09:27:45] should not impact us but we'll see [09:28:13] yesterday the team had some doubts about the new db replication on an-coord1002, so I'll try to do some extra research [09:29:07] hm - what kind of doubts? [09:30:05] in theory the replica on db1108 (where we get backups) and the one on an-coord1002 (where we failover if needed) can be out of sync [09:30:34] so say that db1108 is ahead compared to 1002, when we set it as replica of 1002 (after a failover) it will complain [09:31:17] but IIUC this can be fixed dumping the extra bits from 1108 via binary logs, and load them on 1002 [09:31:33] if 1108 is lagging compared to 1002 then no problem [09:32:14] it should be fine but not everybody was convinced about the solution, so we'll see :) [09:32:18] elukey: Would MariaDB have ways to handle this synchro issue? [09:32:36] elukey: This seems like a usual mutli-follower problem [09:32:57] joal: I am going to follow up with Manuel on this, I think it should be a known use case indeed, but I want to be sure [09:33:08] ack, makes sense :) [09:33:24] yesterday I spent a ton of time to bootstrap the replica on 1002, all time that we can avoid if 1001 fails now [09:34:01] but of course it is something that we'll evolve over time [09:34:08] especially after failover tests [09:35:06] yup - I definitely see the value of that work elukey - As usual, thank you for making our platform more robust <3 [09:35:39] <3 thanks [09:36:12] With your permission, I'll ask some of your time early afternoon to review my prez if feasible [09:36:15] elukey: --^ [09:36:28] elukey: can also be later in the afternoon if you're training :) [09:38:26] joal: so I have workers at home around our 15:00, if you want we can do late morning? (even now if you are good) [09:47:41] I have also asked in the fundraising irc chan if they are ok to let us upgrade hadoop in december [09:47:49] if so, we might have a shot :) [09:51:31] upgrading in December would be great! [09:51:45] elukey: now is not easy, I'm with kids :) [09:52:27] elukey: I'll ping you when they are in bed (probably around 13:30) [09:52:51] ah ok at that time I'll be afk! [10:06:21] if it is ok we can do it before the pres [10:36:40] 10Analytics, 10Better Use Of Data, 10Event-Platform, 10Product-Infrastructure-Data, 10Documentation: [MEP Client Library] Write User-facing Documentation - https://phabricator.wikimedia.org/T267408 (10Aklapper) Did you maybe mean mediawiki.org and wikitech.wikimedia.org? Is this about #Event-Platform, or... [10:43:52] Morning [10:45:49] hola [10:53:36] I just accidentally nuked half of my work from yesterday :( [10:53:44] Oh well, rewriting it will clearly make it better [11:01:39] ouch :( [11:12:02] It builds character :D [11:25:33] 10Analytics, 10Product-Analytics, 10Inuka-Team (Kanban): Set up preview counting for KaiOS app - https://phabricator.wikimedia.org/T244548 (10AMuigai) [11:42:57] * elukey afk! lunch [13:07:31] hellooo [13:55:28] fdans: hola! [14:24:34] helloo teammm [14:27:30] 10Analytics-Clusters: Create kafka-jumbo mirror cluster - https://phabricator.wikimedia.org/T268074 (10Ottomata) @elukey, what ZK does HA failover use in the analytics-test-hadoop cluster? Also, what nodes should we build this on? If we do this work, I'd prefer to do something permanent, a kafka jumbo-test-eqi... [14:40:30] 10Analytics-Clusters: Create kafka-jumbo mirror cluster - https://phabricator.wikimedia.org/T268074 (10elukey) >>! In T268074#6630532, @Ottomata wrote: > @elukey, what ZK does HA failover use in the analytics-test-hadoop cluster? an-conf100[1-3], our cluster! > Also, what nodes should we build this on? If we... [14:42:24] mforns: holaaaa [14:42:32] heyaaa [14:42:45] I have another thing for you when you have time, so your ops week is not boring :D [14:43:05] https://phabricator.wikimedia.org/T265971 [14:45:16] 10Analytics-Clusters: Create kafka-jumbo mirror cluster - https://phabricator.wikimedia.org/T268074 (10Ottomata) Ok, let's not call this kafka jumbo-test-eqiad then. I think just test-eqiad is best, and we can use the cluster at whim for various upgrades, etc. > I don't think that on Ganeti there is a lot of... [14:50:15] reading [14:51:07] ok, cool, will do! [14:51:19] mforns: so I copied stuff over to stat1004 already [14:52:31] * elukey afk for a bit, workers at home [15:02:16] heya ottomata I'm wondering about the canary errors, if there's sth I can do there as part of ops week, or it's already taken care of? [15:03:02] mforns: not much you can do [15:03:02] i [15:03:05] i'm working on it in https://phabricator.wikimedia.org/T266573 [15:03:16] about to deploy something that may or may not help now [15:03:26] ok [15:03:29] yesterday [15:03:42] i learned that the MediaWiki action Api will return HTTTP 200 almost always [15:03:48] even if there are errrors [15:03:53] O.o [15:04:03] https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/641566/4/lib/stream-configs.js [15:04:04] that's why then [15:04:20] could be [15:06:19] post +1 [15:15:02] elukey: I've seen you're away - I'm interested if you have a minute before 5pm :) [15:15:42] joal: I am free now [15:15:45] !! [15:15:47] \o/ [15:15:56] batave? [15:16:55] joining [15:25:56] ottomata: do you know if you can add the --files arg to spark_opts more than once? Meaning: --files /some/file,/some/other/file --master yarn --deploy-mode cluster --files /yet/another/file ? [15:27:20] I'm trying to test that now, but just checking if you know it by chance :] [15:27:28] hm, i don't know! [15:28:16] ok, no problemo! [15:33:31] ottomata: mmm, spark2-shell does not support it, it only considers the last instance of --files arg. So I assume spark2-submit is the same [16:01:08] 10Analytics-Kanban: Analytics Hardware for Fiscal Year 2020/2021 - https://phabricator.wikimedia.org/T255145 (10RobH) [16:07:49] 10Analytics-Radar, 10Product-Analytics: Content for analytics.wikimedia.org - https://phabricator.wikimedia.org/T267254 (10mpopov) I think T267251 would be great to have as part of this [16:09:52] 10Analytics, 10Product-Infrastructure-Data, 10Wikimedia-Logstash, 10observability: Create a separate logstash ElasticSearch index for schemaed events - https://phabricator.wikimedia.org/T265938 (10CDanis) Quick question: when the time comes, will it be possible to dump all the old NEL events out of the exi... [19:51:53] 10Analytics, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 3 others: Clients need to generate an ISO 8601 formatted timestamp - https://phabricator.wikimedia.org/T240460 (10Ottomata) FYI @nettrom_WMF, I just deployed the change today to make legacy EventLogging events have `dt` be a serve... [19:54:14] Gone for tonight team [19:54:37] sorry mforns was dong a deploy [19:54:43] not sure, buut the change is not what I expected! [19:54:56] it is sticking the spark_extra_files into the job config file, instead of the CLI! [19:55:00] looking now [19:55:32] oh i know [20:02:19] mforns: [20:02:19] https://gerrit.wikimedia.org/r/c/operations/puppet/+/641819 [20:03:02] mforns: how does augment_netflow know where the network_region_config file is? [20:08:51] ottomata: looking at the change [20:10:55] ottomata: ah! I understand, and I think it still won't work! [20:11:12] you're right [20:12:26] ottomata: oh, no... the file path is hardcoded, because the transform function does not take args [20:12:33] ok, it should work indeed [20:13:01] thanks for the fix! :] [20:13:09] ok cool [20:25:07] I'm looking for more information on https://wikitech.wikimedia.org/wiki/Analytics/Projects/Public_Data_Lake. I found https://phabricator.wikimedia.org/T204950, but it's not clear what the status of this work is. [20:25:34] I'd love to explore this idea further now [20:27:20] balloons: that project got canned due to lack of operational support in cloud vps [20:27:33] e.g. we'd have to re-implement all the monitoring infrastructure that we have in prod manually [20:29:24] ottomata, what monitoring would you be looking for in particular? Monitoring for services? [20:29:53] puppet, icinga, prrometheus, grafana [20:30:39] https://grafana.wikimedia.org/d/pMd25ruZz/presto?orgId=1 [20:30:59] https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1 [20:31:02] stuff like that ^ [20:31:10] we'd have to reimplement all the infrastructure to do that ourselves [20:34:46] ack, thanks ottomata. The need for this has come up again due to changes being made to the wiki replicas. Assuming we can solve the monitoring issue, are you open to discussion on implementing? [20:35:48] well, since we canned it, somehow the search team was able to make some elasticsearch replicas in prod queryable from cloud vps...i think [20:35:55] so, there must be some way to allow that [20:36:31] we were told that wasn't allowed via networking firewall holes, but i think they somehow just opened up their IPs to the public internet and then added some non network level access restrictions of some kind [20:37:50] ottomata, we are actively improving the networking story. So yes, that doesn't need to be a blocker [20:40:52] balloons: if we can build the cluster in prod and somehow expose it (safely) for queries from Cloud VPS [20:40:55] this would be possible [20:41:13] although, we don't have the hardware for it right now...we might be able to reorganize a bit to free up some [20:41:15] but doubtful [21:00:21] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Ottomata) o/ I don't have all of the context here, but in the past, the Analytics e... [21:02:38] the way the search replicas in prod work is that the servers live outside prod, i dunno if DMZ is the right term but they have publicly routable ip addresses. The only reason only cloud can talk to them is ferm rules on the indvidual boxes [21:02:45] err, the search replicas for cloud i mean [21:03:42] ebernhardson: outside prod? [21:03:52] ottomata: well, by outside prod i mean the normal prod network you get an internal address [21:03:56] these boxes have public ip's [21:03:59] right ok [21:04:16] but they still are in production infra, e.g. monitorable with icinga and prometheus [21:04:19] that we use in prod [21:04:26] ya? [21:04:27] yes, just like other prod things with public ips [21:04:32] right [21:04:41] so that is an option too balloons ^ [21:04:57] that project just got canned and then deprioritized. i think we would still like to do it [21:05:10] would need to talk about goals, priority etc. with the team [21:05:11] awesome. We would love to help :-) [21:05:22] if we could co-maintain that with cloud engineers too that would be amazing [21:05:57] i'm sure milimetric has lots of thoughts :) [21:17:14] balloons: I'm willing to work on re-prioritizing this. I saw the cloud db reorg and was sad about this task again. So I'm glad you bring it up. I think we have some big challenges you saw above and also the data currently updates monthly and we'd like to make that daily or hourly. But even that would still be too slow for some use cases on cloud. Solutions are possible but as we approach real time they get complicated [21:20:09] milimetric, completely understand. My goal would be to get a system in place for those use cases we aren't supporting well now. A solution, even if imperfect, is the ideal outcome to me. We can always make further improvements [22:13:14] * razzi out for another walk [22:23:31] 10Analytics, 10Product-Analytics: Analyze differences between checksum-based and revert-tag based reverts in mediawiki_history - https://phabricator.wikimedia.org/T266374 (10nettrom_WMF) We're quickly running out of time in Q2, so moving this to Q3. [23:07:45] 10Analytics, 10Beta-Cluster-Infrastructure, 10Event-Platform, 10Patch-For-Review: Server returned error: HTTP 500 appears while trying to open VE or reply to a comment on Beta cluster - https://phabricator.wikimedia.org/T268184 (10Jdforrester-WMF)