[02:20:33] (03PS1) 10Gerrit maintenance bot: Add smn.wikipedia to pageview whitelist [analytics/refinery] - 10https://gerrit.wikimedia.org/r/632829 (https://phabricator.wikimedia.org/T264859) [06:40:00] good morning! [06:40:29] I just realized that the hdfs-balancer is on an-coord1001, I was really convinced it was on launcher [06:40:32] I am moving it now [06:46:12] !log move the hdfs balancer from an-coord1001 to an-launcher1002 [06:46:15] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:52:34] ok looks like it was working [07:14:43] !log decom analytics1043 from the Hadoop cluster [07:14:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:20:30] blocks replication started [07:51:22] Good morning! [07:57:54] o/ [07:58:03] anything I can help with elukey ? [07:58:36] joal: I have reimaged an-scheduler1001 to Stretch, so in theory we could move oozie there (I am working on the old CR that I opened a while ago) [07:59:17] * joal preps for some oozie battle [08:00:20] joal: otherwise we can do it next week, there is really no rush [08:00:58] you're the lead elukey, I'll cover your back when you need :) [08:03:14] will keep working on the patch, if I feel ok I'll ping you later :) [08:03:22] ack! [08:03:28] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=4&orgId=1&refresh=5m&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics worries me a lot [08:04:52] 10Analytics: Update Wikidata usage metric - https://phabricator.wikimedia.org/T264945 (10JAllemandou) @Nuria: I disagree - The reason for which we can't have historical data for this metric is because `wmf_raw.mediawiki_wbc_entity_usage` is not historified. We could always use the last available dump of `wmf.med... [08:05:30] hm [08:07:36] elukey: the big users AFAICS are presto and hive-server2 [08:08:02] presto is 4g, like oozie, not much [08:08:09] the two hives are big [08:08:29] we could think about having a different xms for those, to free some space [08:08:51] but I like more when xms==xmx so we pre-allocate all that we need, less surprises [08:09:13] if we remove oozie it will be less cpu usage, less ram usage (4G) [08:09:23] elukey: should we use an-scheduler for hive? [08:09:58] joal: this is what I asked yesterday, but then we'd not have a clear test environment for airflow/whatever/etc.. [08:10:14] right [08:10:34] I recall that Marcel was saying that some CPU was needed, also if we parallelize more things we'd need to test on a host with cores et.c. [08:10:52] what do you think? [08:10:55] I am open to any suggestion [08:11:45] For the moment we still are at decision level based on features, not integration test - We probably should move to secure hive in term of memory, and when we need to integration-test a scheduler we'll ask back (could also be a ganeti VM?) [08:12:36] elukey: side-note - I have seen a post from Jarek today advertizing that airflow 2.0 was in the pipes - We should have a look at what comes next in terms of features [08:14:06] joal: sure but whatever we want to test will run a host, and a vm is very limited in cores/memory for our use cases.. without oozie and hive an-coord1001 should be fine [08:14:26] the next step after this would be, if people agree, to use one of the presto workers as coordiantor [08:14:30] *coordinator [08:14:54] so we can selectively move the coordinator across nodes, rather than having it on coord [08:15:03] and that will be less network/ram/cpu usage etc.. [08:15:10] so down to -8G [08:15:28] plus an-coord1001 will get to 64G of ram soonish, hopefully [08:15:43] this is my medium/long term idea [08:15:46] an-coord with 64G ram is probably safe for what it does [08:16:17] Which means that an-scheduler can be used for something else I guess (scheduling or other) [08:16:20] it is also a matter of resiliency, if it goes down now it brings too many things with it [08:16:29] yes this is my idea [08:16:36] I wonder: Is an-scheduler similar to an-coord in term of hardware-conf? [08:16:51] it was a notebook, 32 vcores and 64G or ram [08:16:59] but not a lot of disk space [08:17:11] (~100G, that is ok for scheduling) [08:18:18] the alternative is renaming an-scheduler1001 to an-query1001 [08:18:28] and move presto coord + hive to it [08:18:47] (an-query1001 is Andrew's approved so we are fine :D) [08:19:16] elukey: so to summarize- We need 2 coords with 64Gb RAM for redundancy (we can use active-active with manually splitting services, but we need two hosts so that if one goes down we still have another one) - Currently we have only one, but an-scheduler is not used (yet) - We have presto hosts that we could use for something else than just presto (very beefy) [08:19:48] Wow - Nice elukey (the last post with an-query) [08:20:07] it would require some work but I could do it [08:20:20] We rename an-scheduler to an-query, move presto and hive there, and then rename (at some point) an-coord to an-scheduler [08:21:01] This strategy --^ leaves us with splitted services, but no backups if one of those fail [08:21:05] coord still holds the meta db, so calling it scheduler would be misleading in my opinion [08:21:27] why no backup? [08:21:38] hm - I thought we had a dedicated mysql for our meta+oozie dbs [08:21:54] * joal has memory issues - too many writes to sdd [08:22:29] having two similar nodes with 64G of ram makes us be able to move services around if needed [08:22:40] an-coord1001 will get some soon [08:22:57] and the database can failover now to db1108 [08:24:26] Then a question - Should we use db1108 as master for meta+oozie, and have separated services as defined above? (for understanding facilitation) [08:25:11] db1108 should remaing a replica, we have also bacula backups running on it [08:25:20] Another way is to rename an-scheduler to an-coord1002, move stuff to it as needed and have all our base services split between an-coords [08:25:22] (and it replicates also piwik etc..) [08:25:31] can we batcave for a minute? [08:25:37] yes I was thinking the same, an-coord1002 [08:25:38] it'll make discussing options easier [08:25:43] but Andrew will hate it for sure [08:25:47] :D [08:25:48] :D [08:26:04] yep lemme grab my laptop so I can make coffee [08:35:13] joal: sorry still working on it [08:35:19] np elukey - [08:35:22] no rush :) [08:35:36] hardware problems are always the most troublesome [08:38:33] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging - Applied at most in next deployment train" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/632829 (https://phabricator.wikimedia.org/T264859) (owner: 10Gerrit maintenance bot) [08:39:23] joal: better if we do it in here sorry [08:39:27] np elukey [08:40:34] So, my idea is that if we go for 2 differently named hosts (an-scheduler and an-query for instance), it feels bizzare to have our single holding both data on one of them [08:40:58] similarly, if one of them fails, we'll probably use the other handle the services while repairing [08:41:31] This makes me feel that those two hosts are actually the same family, used as active-active with manual splitting as I was saying [08:41:49] Happy to do otherwise, just expressing feelings on names here elukey :) [08:43:30] it makes sense to have to an-coord100x for me, the only issue in having active/active is that you split services but you risk, after a bit of time, to forgetting to test if all services can run ok on one node [08:43:49] so when you need it the most (one of the host down), troubles [08:44:06] we could solve this easily with periodical drill days [08:44:20] *fire drill days [08:45:36] elukey: yes - another way could be, as you were saying, to reuse a computation-worker as a coord host in case of fire [08:45:51] which feels feasible given the number of hosts we have [08:46:39] nono what I meant was to move only the presto coord to a worker [08:46:47] permanently I mean [08:46:51] Ah [08:47:08] and then move it around if needed [08:47:16] so separating presto from an-coord [08:47:25] If we go for 2 coord hosts, I don't see it necessary [08:47:25] Morning! [08:47:31] Hi klausman :) [08:47:37] !log Starting re-image of stat1006 to Buster [08:47:39] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:48:10] joal: it separate concerns, there is no real need for presto on an-coord in my opinion [08:48:22] a worker can act as coordinator [08:49:05] works for me elukey - my reason for keeping it was to have a coord that would be a query-coord (if I may say) - But I don't really mind [08:49:49] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin1001.eqiad.wmnet for hosts: ` ['stat1006.eqiad.wmnet'] ` The log can be found... [08:49:54] elukey: to put it in other words - we have hive backend stuff in coord, so why not presto? [08:51:41] joal: my point is that less things a host/vm does the better [09:00:40] joal: anyway, I start to like the an-coord1002 idea, it gives us a way to have a quick way to failover if needed [09:01:18] we keep services split for convenience, but if needed they can go on one node [09:01:40] in case of nuclear disaster (an-coord1001 hw down for days) [09:01:43] we would: [09:02:05] works for me elukey - And possibly the need to be sure that it all fits on a single node makes us improve our fire-testing procedures :) [09:02:08] 1) move away services to coord1002 (probably only oozie will remain there, and maybe airflow?) [09:02:18] 2) failover manually to db1108 for the db [09:02:30] total time, probably one hour ma [09:02:32] *max [09:02:47] the problem will be to convince Andrew :D [09:04:11] elukey: maybe we can also think of having a third coord host acting as another backup, allowing for even more service splitting? [09:04:39] joal: nono too many nodes, two are sufficient [09:04:55] Or we can officialy say that one of workers is known to be dedicated to fire-fighting help in case computing power is needed [09:05:00] elukey: --^ [09:05:51] we can always think about finding space on other hosts, but I'd prefer not (of course it may be needed, in that case we'll have to do it) [09:06:20] eventually I'd like to move the database away to a dedicated db node [09:06:29] so the coords will become stateless [09:06:42] ack elukey - thanks for brainstorming with me on that, it helps me understand service-placement :) [09:06:56] thanks as well, I think this could be a good solution [09:06:59] elukey: I very much like that idea (stateless coords0 [09:07:40] joal: we'll reach that state slowly, we never really concentrated our efforts on removing SPOFs [09:08:10] yep [09:10:17] elukey: have you seen my patch on dropping webrequest-sequence-stats? [09:14:40] nope, checking [09:15:46] (03CR) 10Joal: "Detail optimization - otherwise good for me!" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/632597 (owner: 10Fdans) [09:16:07] joal: looks fine, do you want me to merge? [09:16:21] elukey: let's do that :) [09:17:02] elukey: once applied, can we please run it manually once? [09:19:18] joal: yep already done [09:19:22] it is running [09:19:23] \o/ [09:24:50] 10Analytics: Check home/HDFS leftovers of leila - https://phabricator.wikimedia.org/T264994 (10Kormat) [09:28:24] 10Analytics: Check home/HDFS leftovers of leila - https://phabricator.wikimedia.org/T264994 (10elukey) Should we move the content of the `leila` to `leizi` on stat100x and hdfs? [09:33:05] I am doing a roll restart of the druid overlords on druid analytics to enable TLS for mysql conns [09:33:13] ack elukey [09:33:20] it should now work with the puppet ca being in the default truststore [09:33:33] if those are ok, I'll move to coordinators [09:37:34] of course it doesn't work [09:37:44] I hate java [09:50:28] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1006.eqiad.wmnet'] ` Of which those **FAILED**: ` ['stat1006.eqiad.wmnet'] ` [09:57:12] stat1006 reimaged, no unowned files in /srv, doing one last reboot [09:58:26] 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10elukey) Today I had a chat with Joseph and some ideas came up. Rather than creating specialized hosts (like an-query100x, an-scheduler100x, etc..) we could simply rename an-sc... [10:00:16] joal: added thoughts in --^ [10:00:22] klausman: nice! [10:07:23] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10klausman) Reimage of 1006 and 1007 were successful. [10:12:19] <_joe_> hellooo analytics friends, I have a question [10:13:39] <_joe_> say I want to be able to emit an event on kafka, I want to process this event running a python script, with a lot of dependencies, and it needs to run in production [10:13:48] <_joe_> do we already have a way to do that? [10:14:38] <_joe_> my natural idea would be to set up something like kubeless (https://kubeless.io/) and run said script within a container [10:15:08] <_joe_> I was wondering, though, if we have other ways to do such a thing [10:15:48] <_joe_> I've never looked into the stream processing system you're building, I'm not sure if it could support such a model [10:15:55] As far as I know we don't have any suggested/preferred way to do this, usually people create pip envs on stat100x to do similar things (I think) [10:16:43] Andrew is out this week but later on more people from the team should come online, I'll make sure they read and answer [10:17:05] <_joe_> elukey: I was wondering if we could use flink, since you're setting it up [10:18:27] _joe_ not sure if flink fits the use case, but I am a little ignorant about it.. The discovery team is working on having flink in production, for IIUC for the moment they only run it on hadoop for testing. No work has been done to make flink available on k8s yet [10:18:44] it is in Andrew's backlog [10:19:01] elukey: any idea what was going on with Nurias jup notebooks? She mentioned some breakage in a mail to an-internal [10:19:05] <_joe_> well, on k8s I'd expect a lambda service to be set up, hence kubeless [10:19:18] yep that would make total sense [10:19:41] <_joe_> anyways, thanks, I'll wait for further inputs [10:19:47] ack! [10:20:23] klausman: I think the python venv was messed up, there is a base one that was re-created for buster that the jupyterhub service handles [10:20:31] Right. [10:20:44] So do we need to do this to 1006 as well? [10:20:49] every time somebody logs in to it, a temporary systemd service unit is spawned cloing that venv [10:22:39] the part that I didn't get from the email (I wanted to follow up later on) is what is missing, since nuria seems to have followed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Resetting_user_virtualenvs as we advertised to users [10:23:10] some logs are on journalctl -u jupyter-nuria-singleuser.service [10:23:41] Will have a look-see [10:25:14] there are also logs in journalctl -u jupyterhub [10:25:30] but so far what I see seems to be that the venv was messed up [10:26:13] How would we check for the same damage on 1006? [10:27:27] I checked manually on 1007 for jupyter and it was running fine for me, but this is what I usually do [10:27:47] ssh -L 8000:localhost:8000 stat100X.eqiad.wmnet [10:27:58] localhost:8000 and login with shell+pass account [10:28:32] at this point one is logged to the jupyterhub service, not yet to the singleuser services [10:28:52] they are created if a kernel is launched, that can be done like htis [10:28:54] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) Done! And confirmed with kafkacat, eg: `"comms": "2914:420_2914:1008_2914:2000_2914:3000_14907:4"` As well as no dr... [10:29:06] - open a terminal and kinit first (then it can be closed) [10:29:24] - drop down menu on the right in the main page, and select something like pyspark etc.. [10:29:34] at that point, the singleuser service is started [10:29:45] 10Analytics: Categorylinks dump might have some problem with the encoding - https://phabricator.wikimedia.org/T264850 (10marcmiquel) I've noticed that other languages like Russian or Macedonian have the same problem. [10:29:46] if there is any issue with the venv, it will fail to start etc.. [10:32:32] Roger. Will take a closer look after lunch [10:32:39] will do too, lunch! [10:34:59] !log force the re-creation of default jupyterhub venvs on stat1006 after reimage [10:35:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:35:21] what I did was: [10:36:00] sudo systemctl stop jupyterhub [10:36:00] cd /srv/jupyterhub [10:36:00] sudo rm -rf jupyterhub-venv venv [10:36:00] cd deploy/ [10:36:00] sudo ./create_virtualenv.sh [10:36:03] sudo ./create_virtualenv.sh ../venv [10:36:05] sudo systemctl restart jupyterhub [10:37:01] the create_virtualenv.sh by default creates jupyterhub-venv, but I discovered that the jupyterhub service unit uses "venv" [10:37:04] elukey@stat1006:/srv/jupyterhub/deploy$ sudo systemctl cat jupyterhub | grep ExecStart [10:37:07] ExecStart=/srv/jupyterhub/venv/bin/jupyterhub --config=/etc/jupyterhub/jupyterhub_config.py --no-ssl [10:37:37] some follow up is needed, but in theory now the base venv should use the buster wheels [10:37:48] * elukey lunch [10:43:33] hi klausman, I can confirm that jupyter notebook are not working on stat1006 [10:52:25] dsaez: can you tell me a bit more? [10:53:10] ah yes I see from the logs [10:53:21] dsaez: can you try https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Resetting_user_virtualenvs ? [10:53:39] elukey (I didn't want to interrupt your pranzo), but the server is saying: Spawn failed: Server at http://127.0.0.1:58439/user/dsaez/ didn't respond in 30 seconds [10:53:44] Okk [10:54:41] nono it is fine, I checked another thing, will resume in a bit :) [10:55:16] elukey, this will require to install all the packages again, true? the python packages [10:56:38] elukey, yep, that solved the problem. Thanks! [10:59:01] dsaez: yes correct! [11:34:51] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Amire80) >>! In T207171#6527029, @Nuria wrote: >>Top 100 in each language. > > To be clear, this is mostly not possible... [12:10:09] helloooo team! [12:13:06] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) Awesome! The size of the events has increased in about 25-30%, which is considerable, but I believe sustainable for now. When we sanitize... [12:17:10] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10ayounsi) Wow, that 's more then expected indeed! If it's an issue down the road we could think of filtering out some communities (for example only... [12:43:29] finallyyyyy TLS working between druid and mariadb [12:43:37] going to roll it out everywhere [12:44:30] btw, rocm upstream has fixed the missing mivisionx package [12:45:14] \o/ [12:45:54] klausman: usually people dodge my github issues, you got a fix in one day, there is a different force I believe [12:46:44] Or you're just unlucky :) [12:48:59] nono it is consistent :D [12:50:19] elukey: hello, we're talking kafka with mforns and your opinion would be welcome :) [12:51:56] !log roll restart of druid overlords and coordinators on druid analytics to pick up new TLS settings [12:51:57] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:51:59] joal: sure, bc? [12:52:22] OMW! [12:52:47] mforns: --^? [13:13:36] !log roll restart of druid overlords and coordinators on druid public to pick up new TLS settings [13:13:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:21:45] (03CR) 10Joal: [C: 04-1] "Comments about correctness and completeness. Comments about the approach problems in the associated task." (033 comments) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607361 (https://phabricator.wikimedia.org/T256050) (owner: 10Conniecc1) [13:24:25] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10JAllemandou) Starting to work on this. I have exchanged with @Milimetric about the `platform` field making the number of editors non-additive. @c... [13:28:16] elukey: looks like the data-purge job webrequest_sequence_stats is stuck :( [13:28:29] elukey: systemctl journald to see the logs? [13:30:06] joal: journalctl -u nameoftheunit [13:30:14] sudo of course [13:30:18] lemme check as well [13:31:30] ah weird journald logs rotated [13:32:20] I need to check how to increase retention [13:33:51] 10Analytics: Fix the remaining bugs open on for Hue next - https://phabricator.wikimedia.org/T264896 (10mforns) Luca asked me to give some feedback about hue-next, here are some thoughts. - Overall hue-next looks OK to me, seems I can do all I need from it, except maybe: - I usually like to open workflow instanc... [13:34:49] be back in a bit [13:36:28] elukey: I just realized that the mvisionx change will likely fail, since I missed updating aptrepo/files/updates [13:36:41] So I made https://gerrit.wikimedia.org/r/c/operations/puppet/+/632248 [13:37:00] Er, sorry, wrong URL [13:37:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/632915 This is the one [13:38:04] 10Analytics, 10Analytics-Kanban, 10Operations, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10mforns) After discussing with the team, we think it's fine for now. If we want to add more fields or increase the sampling ratio, then we should i... [13:46:01] sorry elukey got distracted by Naé :) Also, will leave soon to get Lino - Let's try to troubleshoot after standup/meetings [13:53:42] klausman: good point, +1ed [13:54:33] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10Patch-For-Review: Add dimensions to editors_daily dataset - https://phabricator.wikimedia.org/T256050 (10Milimetric) I think Connie's plan was to create dashboards for folks on top of superset, taking care to present the metrics properly. I also unde... [13:54:50] a-team: the train is empty, I'm just cruising around [14:03:30] elukey / mforns what do you think, should we try and fix those two hue issues with a more complete PR and at the same time learn it a little bit so we can submit other fixes, like for Marcel's list? [14:03:47] it would take a bit, but it seems like if we don't Luca's blocked [14:04:30] milimetric: if we are willing to use hue next and fix bugs while we go (namely no big annoying issue etc..) I think it would be best, keeping the two uis atm is not ideal [14:05:17] I mean, sure, doesn't bother me. It adds a bunch of functionality and worst case I can use the command line to check on jobs [14:05:46] yeah but that would defeat the purpose of upgrading, I'd not be happy about people not using the ui anymore :( [14:11:17] wouldn't maintaining a Stretch box with python 2 worse than using the command line once in a blue moon when something in the new UI doesn't work? [14:11:25] *be worse [14:11:56] elukey, I think I broke something in Puppet :-S [14:12:24] https://phabricator.wikimedia.org/P12952 [14:17:07] checking! [14:19:05] when I look at referer class on turnilo, what does "unknown" mean? in what situation is it none of the other options? [14:20:26] klausman: very weird [14:21:02] klausman: the only think that comes to mind is that the list of packages, for some reason, is not ok [14:23:44] But that makes very little sense [14:24:33] Currently running puppet with --debug [14:24:35] yes but I am not sure if the << operator is available for our puppet version, did you check? [14:24:45] Good point [14:25:00] Let's see what I can see in debug mode, that should give clues [14:25:08] yeah it seems so though https://puppet.com/docs/puppet/5.5/lang_expressions.html [14:28:01] I have no idea where that package name (amd-rocm33) comes from. And the --debug log, while siz miles long, contains no clue where it gets that [14:28:29] klausman: I think I may have a theory, but without solid grounds - if you check apt::package_from_component's default, there is [14:28:35] define apt::package_from_component( String $component, Array[String] $packages = [$name], [14:28:47] now $name is, in our case, amd-rocm33 [14:29:28] it seems as if $packages was in some weird form, forcing the define to use its defaults [14:29:31] that don't make sense [14:29:56] Something must have changed besides my two commits. [14:30:24] I think it is a puppet weirdness, I wouldn't be surprised if you were the first one testing << [14:30:38] I can try with + ? [14:30:48] Is there a quick way to test it? [14:31:34] But yeah, go ahead. [14:32:37] ahhhhhhhhh I gooootttt ittttt [14:33:07] I completely missed it while reviewing klausman [14:33:26] $packages is only defined inside the if, so in all other cases it is undef [14:33:52] we should add 'else { $packages = $base_packages }' [14:34:01] or something like [14:34:13] Oh ffs [14:34:14] $packages = $version in $add_firmware_versions ? [14:34:33] { true => etc.., false => $base_packages } [14:34:40] that is probably more elegant [14:34:55] Sure [14:35:05] do you want to send the cr? [14:35:11] can do [14:40:57] elukey: https://phabricator.wikimedia.org/P12953 Like so [14:40:59] ? [14:41:34] Or with using base_packages still? [14:47:02] Made https://gerrit.wikimedia.org/r/c/operations/puppet/+/632943 for your reviewing pleasure [14:47:25] ...and it failed. [14:47:32] a couple of things [14:47:53] - line 79 is missing a ? before { [14:48:07] => needs to be aligned [14:48:22] and I think we can keep << (at this point I am curious about it :D) [14:48:29] then we can run the pcc and see how it goes [14:49:04] Updated. [14:49:07] if you have docker running on your host there is ./utils/run_ci_locally.sh in the puppet repo to run the tests [14:49:45] klausman: the => are not aligned, I think that the linter will complain :) [14:49:52] It did not :) [14:50:13] fixed anyway [14:50:17] thanks! weird [14:50:25] it complains for all other misalignments [14:52:04] looks good now! https://puppet-compiler.wmflabs.org/compiler1003/25784/stat1005.eqiad.wmnet/index.html [14:53:34] hooray [14:53:52] I am truly falling in hate with Puppet :) [14:54:00] that's the spirit! [14:55:43] I think I’ll be a couple minutes late to standup [15:04:08] oh, no standup :) [15:08:57] milimetric: standup is actually after all-tech [15:13:43] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Nuria) >Is it possible for Nigeria, Mali, Kenya, India, Philippines, Romania, Kyrgyzstan? It depends but (other than Rom... [15:16:20] is anyone else not managing to mvn package refinery-core on master because of pageview test failures? [15:17:19] mforns: it makes some time I've not tested that [15:19:11] mforns: using master compilation works for me [15:19:25] oh, ok! thanks [15:22:20] mforns: on your desktop or stat machine? [15:22:50] nuria: on stat1007 it failed. on stat1005 it seems to be working fine! [15:23:08] in both cases I removed and recloned the repo [15:23:33] mforns: maybe the buster migration has something to do with that [15:23:44] mforns: did you cleaned up your .mvn dir? [15:23:53] nuria: no, will try [15:26:33] mforns: do ou use 'clean' before 'package'? [15:26:45] joal: I did separately [15:26:47] I confirm compilation of master has worked for mo on stat1007 [15:26:57] :C [15:27:15] mforns: mvn -pl refinery-core -am clean package [15:27:27] joal: yes that's what I did [15:27:33] meh [15:28:27] nooo... stat1005 is failing for me as well.. [15:28:42] wow this is weird [15:28:48] joal: did you pull the latest code? [15:28:56] mforns: yes, latests master [15:29:03] :O [15:29:42] mforns: it might be that your dependencies are poluted - You can try to drop ~/.m2/repository [15:29:43] by the errors seems related to character encoding [15:29:49] ok [15:30:05] Then you'll have to download the world when compiling, but it;s worth a try [15:30:30] mforns: If it's related to encoding it's too much of linux-magics for me and I'll not be able to help :) [15:30:48] ok, trying [15:33:01] joal: no, same :CCCCCCCCCC [15:34:03] MEH!! [15:35:32] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser) In terms of the privacy considerations for countries with low pageview counts, I found that the most-viewed ar... [15:35:40] mforns: would you mind trying that? https://stackoverflow.com/questions/17656475/maven-source-encoding-in-utf-8-not-working [15:41:20] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Nuria) @lexnasser Nice, yes, same considerations apply to your example. That such a low count is available speaks of a bu... [15:43:57] !log executed git pull on /srv/jupyterhub/deploy and run again create_virtualenv.sh on stat1006 (pyspark kernels not running due to a missing feature) [15:43:59] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:45:02] !log executed git pull on /srv/jupyterhub/deploy and run again create_virtualenv.sh on stat1007 (pyspark kernels may not run correctly due to a missing feature) [15:45:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:50:39] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10JAllemandou) > speaks of a bug I disagree :) The raw data is available for per-language project broadly on the API or on... [15:50:41] 10Analytics: Update Wikidata usage metric - https://phabricator.wikimedia.org/T264945 (10Nuria) Do not disagree, just mentioning this as something to think about. >(there must be a way to rebuild wikidata-item usage in a page when parsing the revision content To my knowledge there is not as there is no "unbund... [15:53:31] so the logs above about stat1006/7 were related to [15:53:31] https://gerrit.wikimedia.org/r/c/analytics/jupyterhub/deploy/+/612484/1/kernels_buster/spark_yarn_pyspark/kernel.json [15:53:56] makes sense --^ elukey !!!! [15:54:00] basically /srv/jupyterhub/deploy is not updated automatically, so it was missing some commits [15:54:13] Tiziano contacted me that pyspark was not working :( [15:54:18] hopefully now we are more on track [15:55:16] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Nuria) >The raw data is available for per-language project broadly on the API or on dumps. You are right. i forgot these... [15:55:21] all stats are up to date, just checked [15:55:46] joal! adding -Dfile.encoding=UTF-8 to the surefire plugin config worked :D [15:55:54] Maaaaaan [15:56:02] I have no idea why though :S [15:56:03] thanks! :] [15:56:20] mforns: I assume we want a patch for that line, just in case :) [15:56:31] joal: yes, was going to say that [15:56:34] will do [15:56:38] <3 mforns [16:02:10] mforns: standup? [16:02:12] ping mforns [16:02:17] oops [16:02:21] is lexnasser coming to stand up today? [16:02:23] 10Analytics-Clusters, 10Analytics-Kanban: PySpark Error in JupyterHub: Python in worker has different version - https://phabricator.wikimedia.org/T256997 (10elukey) 05Open→03Resolved This should be fixed everywhere, closing! [16:21:38] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (033 comments) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [16:23:55] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [16:47:47] 10Analytics-Radar, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 2 others: Deal with Google Chrome User-Agent deprecation - https://phabricator.wikimedia.org/T242825 (10fdans) [16:49:54] 10Analytics, 10Anti-Harassment, 10CheckUser, 10Privacy Engineering, and 2 others: Update uaparser if needed to handle Google's privacy changes - https://phabricator.wikimedia.org/T265057 (10fdans) [16:51:26] 10Analytics-Kanban, 10Patch-For-Review: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 (10Milimetric) disregard (testing whether projects are added when you mention them) [16:51:38] 10Analytics-Kanban, 10Patch-For-Review: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 (10Milimetric) disregard mentioning #analytics [16:51:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 (10Milimetric) [16:53:01] 10Analytics: [refinery-source] Add encoding config to surefire plugin to avoid building issues - https://phabricator.wikimedia.org/T265058 (10mforns) [16:53:38] (03PS1) 10Mforns: Add encoding config to surefire plugin conf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/632962 (https://phabricator.wikimedia.org/T265058) [16:54:02] 10Analytics: Increase in usage of /var/lib/mysql on an-coord1001 after Sept 21st - https://phabricator.wikimedia.org/T264081 (10fdans) Since for the remaining data we won't be using a loop (hourly job), this won't be a concern. The remaining backfilling, which I'm testing right now, will be happening over the ne... [16:58:09] 10Analytics, 10Analytics-Wikistats, 10Commons: Creating tools for compiling list of Wikimedia Commons users by contributions/uploads - https://phabricator.wikimedia.org/T263377 (10fdans) Right now we have this: https://stats.wikimedia.org/#/commons.wikimedia.org/contributing/top-editors/normal|table|last-mo... [16:58:24] 10Analytics, 10Analytics-Wikistats, 10Commons: Creating tools for compiling list of Wikimedia Commons users by contributions/uploads - https://phabricator.wikimedia.org/T263377 (10fdans) p:05Triage→03Medium [16:58:30] nuria: Sorry for the confusion, wasn't free today at 9, I'm planning on attending Mondays and Thursdays in future weeks, lmk if that's fine [16:59:27] 10Analytics, 10Event-Platform: jsonschema-tools should fail if new required field is added - https://phabricator.wikimedia.org/T263457 (10fdans) p:05Triage→03High [17:03:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Improve discovery of paths to delete in refinery-drop-older-than - https://phabricator.wikimedia.org/T263495 (10fdans) 05Open→03Resolved [17:10:39] 10Analytics, 10Analytics-EventLogging, 10Product-Analytics, 10Documentation: Document how ad blockers / tracking blockers interact with EventLogging - https://phabricator.wikimedia.org/T263503 (10fdans) p:05Triage→03High [17:19:59] !log removed /var/lib/puppet/clientbucket/6/f/a/c/d/9/8/d/6facd98d16886787ab9656eef07d631e/content on an-launcher1002 (29G, last modified Aug 4th) [17:20:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:22:13] lexnasser: o/ all ok with your credentials? [17:22:25] 10Analytics, 10Analytics-Kanban, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10fdans) p:05Triage→03Medium [17:23:46] 10Analytics, 10Event-Platform: Q2 goal. Deploy the canary event monitoring for some event streams - https://phabricator.wikimedia.org/T263696 (10fdans) p:05Triage→03High [17:25:21] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Add more popular articles per country data to AQS - https://phabricator.wikimedia.org/T263697 (10fdans) p:05Triage→03High [17:25:45] 10Analytics, 10Analytics-Kanban, 10User-jijiki: Mechanism to flag webrequests as "debug" - https://phabricator.wikimedia.org/T263683 (10Milimetric) @jijiki we talked this over and here are our thoughts: * let's use debug=1 in the header, that way it's more generic, in case other data pipelines need to ignor... [17:26:19] elukey: Gerrit and phabricator work fine for me, but can't login to Turnilo (Service access denied due to missing privileges.) and my keys don't work for SSHing into Stat1007 [17:27:44] elukey: nuria sent this yesterday: "do we need to re-open https://phabricator.wikimedia.org/T235688 for lexnasser to get access?" and "i also asked moritzm via e-mail cause i am not sure what is the process when you "come back" it might be we need new keys" [17:28:57] Ok I'm wrecked - will stop for tonight [17:29:34] lexnasser: I thought that Daniel already restored your creds, lemme recheck [17:29:56] aahah lol one year ago [17:30:14] I've read "October" this morning and my brain bypassed it [17:30:26] okok then I'll take care of restoring in a bit [17:30:27] sorry [17:31:06] lol, all good, I'm just working on a design doc now, so it's not blocking at the moment [17:31:54] ack [17:41:04] (03PS3) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [17:41:31] (03CR) 10jerkins-bot: [V: 04-1] multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [17:42:04] !log restart oozie server on an-coord1001 for T262660 [17:42:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:42:07] T262660: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 [17:43:07] (03PS4) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [17:43:48] (03CR) 10jerkins-bot: [V: 04-1] multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [17:47:27] joal: to test newly-deployed oozie permissions; can you check if you can kill/restart 1 job owned by analytics in hue to ensure permissions are working? [17:48:49] joal: didn't see your message; don't worry about this, elukey and I can test it ourselves [18:08:30] !log restart oozie server on an-coord1001 for reverting T262660 [18:08:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:08:33] T262660: Review and improve Oozie authorization permissions - https://phabricator.wikimedia.org/T262660 [18:31:25] 10Analytics, 10Operations, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10elukey) [18:32:10] lexnasser: created --^, in theory somebody should look at it later on in the day, if not I'll do it tomorrow EU morning [18:32:18] is it ok or are you blocked right now? [18:32:30] (also, do you still have your old ssh key?) [18:35:02] nuria: can you add in the task the new expire date? [18:37:05] ok going to dinner, in case I'll check later (but I added two SREs to the task in CC, so we should get things unblocked soon) [18:47:16] elukey: thanks for your help! [19:25:50] elukey: will do! [19:35:49] 10Analytics, 10Operations, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Dzahn) Using the same key should be fine. But we will need a new "expiry_date" please. And should we use expiry_contact: nruiz@ like before? [20:26:02] 10Analytics, 10Event-Platform, 10Platform Engineering Roadmap Decision Making: Need for new event-type - `user_create` and `user_rename` - https://phabricator.wikimedia.org/T262205 (10AMooney) [20:40:30] * razzi offline for a half hour [21:27:53] (03PS5) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [21:51:22] 10Analytics, 10Operations, 10SRE-Access-Requests: Renable SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T265071 (10Nuria) Expiry contact will be @Ottomata end data is April 1 2021 [21:51:42] (03CR) 10Nuria: [C: 03+2] Add encoding config to surefire plugin conf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/632962 (https://phabricator.wikimedia.org/T265058) (owner: 10Mforns) [21:54:31] 10Analytics-Clusters, 10Patch-For-Review: Review an-coord1001's usage and failover plans - https://phabricator.wikimedia.org/T257412 (10Nuria) +1 to the active/active plan [21:56:16] (03Merged) 10jenkins-bot: Add encoding config to surefire plugin conf [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/632962 (https://phabricator.wikimedia.org/T265058) (owner: 10Mforns) [22:03:21] (03CR) 10Nuria: [C: 04-1] Add DesktopWebUIActionsTracking fields to eventlogging allowlist (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/631988 (https://phabricator.wikimedia.org/T263143) (owner: 10MNeisler) [22:08:24] Hello A-team, [22:08:24] I'm getting a "Spawn failed" failed message when I ssh into stat6, navigate to jupyter lab and try to launch the server. I've updated the fingerprint. What am I missing? [22:12:45] 10Analytics-Radar, 10Product-Analytics, 10Anti-Harassment (The Letter Song): Capture special mute events in Prefupdate table [4 hour spike] - https://phabricator.wikimedia.org/T261461 (10ARamirez_WMF) [22:13:07] iflorez: after Os upgrade your env needs to re re-seted: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Resetting_user_virtualenvs [22:13:29] 10Analytics-Radar, 10Product-Analytics, 10Anti-Harassment (The Letter Song): Capture special mute events in Prefupdate table [4 hour spike] - https://phabricator.wikimedia.org/T261461 (10ARamirez_WMF) [22:14:28] iflorez: [22:14:37] https://www.irccloud.com/pastebin/yIuTUzZD/ [22:16:13] thank you [22:21:18] 10Analytics, 10MediaWiki-REST-API, 10Platform Team Sprints Board (Sprint 5), 10Platform Team Workboards (Green), 10Story: System administrator reviews API usage by client - https://phabricator.wikimedia.org/T251812 (10eprodromou) [22:30:36] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10lexnasser) Just finished a first draft of the design doc for this project! You can find it here: https://docs.google.com/... [22:38:10] (03PS6) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [22:38:35] (03CR) 10jerkins-bot: [V: 04-1] multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [22:40:32] (03CR) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [22:42:37] (03PS7) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) [22:43:09] (03CR) 10jerkins-bot: [V: 04-1] multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254) (owner: 10Bstorm) [22:47:18] (03PS8) 10Bstorm: multiinstance: Attempt to make quarry work with multiinstance replicas [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/632804 (https://phabricator.wikimedia.org/T264254)