[10:17:46] 10Analytics-Clusters, 10Operations, 10vm-requests: Create a ganeti VM in eqiad: an-test-ui1001.eqiad.wmnet - https://phabricator.wikimedia.org/T266648 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_B an-test-ui1001.eqiad.wmnet --vcpus 2 --memory 4 --disk 20 --network analytics START -... [10:30:28] klausman: morning! [10:30:40] if you are feeling better today we could create a VM in ganeti [10:30:45] for the test druid node [10:39:06] Sounds good [10:42:59] https://wikitech.wikimedia.org/wiki/Ganeti#Create_a_test_VM is the starting point [10:43:24] luckily ipam in all the dcs (except codfw) is now managed by netbox [10:43:40] so no DNS changes needed upfront to allocate ips etc.. [10:43:52] (we used to do changes manually before recently) [10:46:41] So the fields in the Creation ticket. What's the project name? Location is probably eq, service name "Druid"? Internal network, as for the size of the machine, I have no clue. [10:47:45] I think we can skip the project, eqiad is right, service name I thinl an-test-druid1001.eqiad.wmnet [10:48:04] network is Analytics VLAN (there is a option for it in the cookbook) [10:48:18] for the vm size, we'd need to think about it [10:48:42] in the old test cluster, we had an old hadoop worker as single-node test cluster [10:48:55] but it was relatively beefy compared to what a vm normally is [10:49:35] I think we can definitely settle for something like 4 vcores, 8G of ram, 50/100G disk [10:49:48] that is not a small vm [10:49:55] but not huge either :) [10:50:19] Ok, I just looked at an-druid1001, and its druid JVM has 73G RSS O.o [10:50:22] Druid has a on-disk cache so 100G might be good [10:50:39] yeah but we load a ton of data to those nodes [10:51:03] for the test cluster we just upload a little little subset of data [10:51:09] just to prove that the whole pipeline works [10:51:09] Ok, so 4c, 7G, 100G. Any other requirements? [10:51:14] 8G [10:51:19] typo :) [10:51:22] ack :) [10:51:30] and analytics vlan, other than that I'd say no [10:51:38] ok [10:55:40] klausman: let's rename the title with something more meaningful :D [10:55:59] Sorry, trying to grok the six miles of prod server lifecycle doc [10:57:09] :) [10:57:32] So I have to create the vm in netbox first? [10:57:51] nono we can just run the cookbook [10:58:08] it will allocate the ips etc.. and also commit the dns change [10:58:14] Ok. [10:58:22] (the step to generate the commit takes minutes, be patient) [10:58:39] to run the cookbooks, we can jump on cumin1001 [10:58:54] and with sudo cookbook sre.ganeti.makevm -h you can see the options [10:59:09] the other thing to decide is what row to use, usually I pick the least crowded [10:59:13] there are steps on the doc [10:59:58] Hurm. there's no easy way to see which Ganeti node is on which row? [11:00:41] https://netbox.wikimedia.org/search/?q=ganeti10&obj_type= [11:02:00] Does this look sane: [11:02:02] sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 100 --network analytics eqiad_D an-test-druid1001.eqiad.wmnet [11:05:08] checking [11:05:34] +1 looks good [11:05:47] remember to run it in tmux/screen [11:05:51] (veeery long) [11:06:50] aaand it once more does not see my tmux [11:07:23] I usually tmux then sudo, and it works fine [11:08:01] https://phabricator.wikimedia.org/P13100 [11:08:05] It does not, for me. [11:08:54] very weird [11:09:09] FOr some reason, sudo unsets the TMUX var for me: [11:09:13] cumin1001 ~ $ sudo -i [11:09:15] root@cumin1001:~# echo $TMUX [11:09:19] root@cumin1001:~# [11:10:14] And looking at cumin's sudoers/sudoers.d it's no surprise. Nothing in there preserves $TMUX [11:10:49] I mean, I can set it manually, of course (which I'll do now), but something is broken [11:12:46] so I can repro as well, but if I just tmux + sudo echo $TMUX it works [11:12:57] but we can open a task to infra for this [11:13:06] oh, you mean tmux sudo cookbook, all in one commandline? [11:13:29] nono first tmux, then sudo cookbook etc.. inside [11:13:33] this is what I do usually [11:13:58] well, that doesn't work for me (and never has) [11:14:10] no idea then [11:14:34] I mean, if you start tmux on cumin, then sudo -i (to just a shell), is TMUX set for you? [11:16:06] Ah! but `sudo echo $TMUX` has $TMUX evaluated in the current shell, not the sudo'd one [11:17:06] ack [11:17:15] let's wait a bit before attempting again [11:17:32] Well, the VM thing is running now, I set TMUX by hand [11:17:40] ah okok [11:40:43] klausman: when the cookbook is done, you should see at the end a mac address [11:41:14] what I do usually is pasting all output in the task https://phabricator.wikimedia.org/T266648#6587763 [11:45:13] ack, will do [11:49:39] going afk for lunch, will check later :) [11:55:48] Aye, capitano [12:47:34] * klausman lunch [13:27:24] ahahha thanks for the "capitano" [13:37:49] klausman: when you are done ping me so we can finish druid :) [13:38:05] I'll be back in 20m or so [13:38:46] yep even an hour, didn't mean to rush :) [14:07:33] 10Analytics: Quick data exploration CLI - https://phabricator.wikimedia.org/T265765 (10CDanis) Visidata looks amazing, thanks! I love this idea. The biggest use case I have in mind right now is for ad-hoc analysis of Network Error Logging data (see T257527). This is a case where we have no control over the cl... [14:09:17] 10Analytics-Clusters, 10Analytics-Kanban, 10Operations: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10elukey) Change rolled out! [14:25:11] !log restart zookeeper on an-conf1001 for openjdk upgrades [14:25:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:29:32] (03PS2) 10Mforns: Add Refine transform function for Netflow data set [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) [14:33:03] (03PS3) 10Mforns: Add Refine transform function for Netflow data set [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) [14:33:23] (03CR) 10Mforns: Add Refine transform function for Netflow data set (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [14:40:06] 10Analytics-Clusters, 10Analytics-Kanban, 10Operations: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10MoritzMuehlenhoff) 05Open→03Resolved [14:40:16] elukey: Ready when you are [14:47:27] Made a patch for the DHCP entry. Do we also need a partman setup or will there be a useful default? [14:50:35] here I am [14:51:01] yep it also need partman [14:51:14] see https://gerrit.wikimedia.org/r/c/operations/puppet/+/637387 [14:52:56] ack [14:53:18] I presume this'd work: an-test-druid*) echo partman/flat.cfg virtual.cfg ;; \ [14:53:45] yep, you can also add it |an-test-druid* to the other ones [14:53:47] as you prefer [14:54:37] (03CR) 10Mforns: [V: 03+2] "@Joal, responding to your comments:" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [14:54:46] Updated the change [15:03:06] yep looks good [15:03:27] one nit - I usually add an entry in site.pp as well to avoid an extra change, but not really important [15:03:48] klausman: so now puppet needs to run on the dhcp hosts before starting the vm, as described in the doc [15:04:57] Ack, already done that. Also running it for apt (partman) [15:06:22] Is role(druid::analytics::worker) the right role? [15:10:04] for the moment let's use role(insetup) [15:11:22] roge [15:11:24] +r [15:13:13] Role: https://gerrit.wikimedia.org/r/c/operations/puppet/+/637511 [15:14:55] Will now start the instance and see to the install [15:17:02] ack [15:18:10] 10Analytics, 10Analytics-Wikistats, 10I18n: [[Wikimedia:Wikistats-metrics-top-mediarequests-name/jam]] translation issue - https://phabricator.wikimedia.org/T266669 (10Aklapper) [15:19:36] How long does it usually take before the console shows something useful? [15:21:22] very soon usually [15:21:26] line a couple of mins [15:22:19] I checked with gnt-instance info and I see "State: configured to be up, actual state is up" [15:22:38] so in theory the start should just work [15:23:42] `ganeti1011 ~ $ sudo gnt-instance console an-test-druid1001.eqiad.wmnet` seems to just hang [15:25:06] weird it should be running [15:26:12] does the console work for you? [15:27:12] nope [15:28:05] ctrl+] seems also not working [15:29:02] klausman: fixed-address an-test-druid1001.eqiad.net; [15:29:05] spot the typo :) [15:29:38] oops. Not enough wm [15:42:17] A-ha, now it works [15:45:10] gooood [15:47:11] cdanis: o/ [15:47:25] so, you want to get the NEL data into Hive instead of logstash? [15:47:32] if so, and if we keep client_ip there, which we can [15:47:35] you'll get auto geocoding [15:47:52] I think so? but I wanted to make sure it would still be vaguely realtime there [15:48:07] it lags by a few hours [15:48:10] ah [15:48:12] it'd still be in kafka [15:48:15] realtime [15:48:15] hm :/ [15:48:51] I have to admit that I feel like I'm missing context on a lot of things: why client_ip is undesirable in logstash, and if a realtime-ish setup like we have for wmf_netflow is easily reproducible or maintainable [15:50:20] ah [15:50:37] is netflow realtimeish? don't remember, ah yes we are ingesting into druid for that? [15:50:47] i'm pretty sure the realtime netflow stuff isn't going to have auto geocoding [15:50:55] as for client_ip in logstash [15:51:00] i don't really get it either [15:51:25] but i guess timo feels that logstash data has a simpler attack vector than analytics cluster stuff, and he doesn't think we should keep client_ip there [15:51:34] but, i guess that can be discussed more [15:51:49] i don't think keeping it there (for < 90 days) is really any different than what we do in Hadoop [15:51:57] but ¯\_(ツ)_/¯ [15:52:16] yeah we're ingesting into druid for netflow [15:52:22] yeah I don't either [15:52:31] OTOH I'm not that happy with kibana anyway ofc 🙃 [15:53:08] ha me neither [15:53:21] i think we could probably do the same for nel that we do for netflow...i think its a bit manual but it could be ok [15:53:40] but, if you want the counry and ASN in there [15:53:47] yeah we 'd have to get it from the headers and set it [15:53:51] in eventgate i guess [15:53:52] hmmm [15:54:11] man that schema based header population idea is a good one, it just is a bigger decision than a one off [15:54:12] maybe.. [15:54:14] lemme think about it more and discuss [15:54:19] maybe we can do that [15:54:20] i mean [15:54:33] i can always add a stupid 'if $schema == NEL' { ...} [15:54:36] to eventgate-wikimedia [15:54:36] haha yeah [15:54:38] but that sucks [15:54:41] I had considered that too [15:54:51] we could even do the mmdb lookups there, but yes, that is not great [15:55:03] (not that doing it in inline C in VCL in ERB is great either, but) [15:55:06] elukey: running first puppet pass (cert is done and signed) [15:58:56] nice! [16:00:42] the right thing would be to do ths mmdb look\ups in a stream processor! [16:02:28] fdans: yoohoo [16:03:01] andrewbogott: uh oh adding analytics-internal to cloud-announce was a mistake! [16:03:04] can you remove that? [16:03:30] ottomata: how so? It hasn't been that noisy has it? [16:03:56] + don't you have an unsubscribe link? [16:04:52] anyway, I unsub'd you [16:05:15] we'll have to sub individually [16:05:34] otherwise dan will have to approve person who emails cloud-announce in analytics internal [16:05:41] and anyone who replies to cloud-announce emails [16:05:45] ah, I see, that sounds like a pain [16:05:52] but, thank you! [16:15:21] 10Analytics: Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete - https://phabricator.wikimedia.org/T252585 (10Ottomata) [16:17:07] 10Analytics, 10Event-Platform: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [16:17:24] 10Analytics: Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete - https://phabricator.wikimedia.org/T252585 (10Ottomata) [16:17:26] 10Analytics, 10Event-Platform: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10Ottomata) [16:17:40] 10Analytics: Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete - https://phabricator.wikimedia.org/T252585 (10Ottomata) To call this task done, we should first complete {T266798}. [16:17:54] 10Analytics, 10Event-Platform: Refine event pipeline at this time refines data in hourly partitions without knowing if the partition is complete - https://phabricator.wikimedia.org/T252585 (10Ottomata) [16:18:14] 10Analytics, 10Patch-For-Review: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 (10fdans) [16:18:24] elukey: post-install puppet runs complete (all changes converged, no mor changes happening) [16:19:14] \o/ [16:20:06] yep I can login, very nice [16:20:12] I have a code change ready to deploy the role [16:20:50] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Undo any temporary changes made while running in codfw - https://phabricator.wikimedia.org/T261865 (10Milimetric) 05Open→03Resolved [16:21:13] Hi all [16:21:16] hi dsaez [16:21:32] 10Analytics-Clusters, 10Analytics-Kanban: Move the stat1004-6-7 hosts to Debian Buster - https://phabricator.wikimedia.org/T255028 (10fdans) 05Open→03Resolved [16:21:34] 10Analytics, 10Analytics-Kanban: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 (10fdans) [16:21:46] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Combine filters and splits on wikistats UI - https://phabricator.wikimedia.org/T249758 (10fdans) 05Open→03Resolved [16:21:52] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Import page_props table to Hive - https://phabricator.wikimedia.org/T258047 (10fdans) 05Open→03Resolved [16:21:58] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats: Refactor breakdowns so they allow more than one dimension to be active - https://phabricator.wikimedia.org/T255757 (10fdans) 05Open→03Resolved [16:22:00] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10Patch-For-Review: Combine filters and splits on wikistats UI - https://phabricator.wikimedia.org/T249758 (10fdans) [16:22:17] fdans: you should get bulk-edit task permissions from Andre so you can just multi-close everything in the Done column [16:22:39] milimetric: but we don't want to do all of them for now right? [16:22:45] there are a bunch that haven't been discussed [16:22:47] Quick Question: There is a way to do cross-db queries in mysql in the current setup on the stat machines? For example I need to join results from wikidatawiki with EnWiki [16:23:03] fdans: yeah, but that's temporary, eventually we'll be caught up [16:23:13] dsaez: no [16:23:36] thanks milimetric. I'll go with HIVE then. [16:23:40] (that's one of the reasons we import the data into hadoop and are trying to figure out a way to make a more real-time import) [16:24:32] (the reason is because the databases are sharded over different servers, and especially the big ones like enwiki and wikidatawiki, those are like one per box or something) [16:24:32] got it. [16:32:26] 10Analytics-Kanban, 10Triagers, 10acl*phabricator: Add fdans to triagers for batch task editing - https://phabricator.wikimedia.org/T266801 (10fdans) [16:36:36] 10Analytics-Radar, 10Editing-team, 10MediaWiki-Page-editing, 10Platform Engineering, and 2 others: EditPage save hooks pass an entire `EditPage` object - https://phabricator.wikimedia.org/T251588 (10fdans) cc @Milimetric [16:37:18] 10Analytics, 10Analytics-Kanban: Analytics Presto improvements - https://phabricator.wikimedia.org/T266639 (10fdans) [16:45:37] 10Analytics-Clusters, 10Operations, 10vm-requests: Create a ganeti VM in eqiad: an-test-ui1001.eqiad.wmnet - https://phabricator.wikimedia.org/T266648 (10elukey) 05Open→03Resolved [16:45:57] 10Analytics, 10Analytics-Kanban, 10Event-Platform: eventgate-analytics-external occasionally seems to fail lookups of dynamic stream config from MW EventtStreamConfig API - https://phabricator.wikimedia.org/T266573 (10fdans) p:05Triage→03High [16:47:22] 10Analytics: Decide to move or not to PrestoSQL - https://phabricator.wikimedia.org/T266640 (10fdans) p:05Triage→03Medium [16:47:30] 10Analytics: Test Alluxio as cache layer for Presto - https://phabricator.wikimedia.org/T266641 (10fdans) p:05Triage→03Medium [16:51:05] 10Analytics-Radar, 10Operations, 10ops-eqiad: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10Cmjohnson) Spoke with Dell tech, Chris Bennet today. The ball was dropped by Dell, nobody ordered the new part and our case was left open and not owned by anyone. Today a new case for the backpl... [16:51:20] 10Analytics, 10Wiktionary: Add editors per country data for Wiktionary projects - https://phabricator.wikimedia.org/T266643 (10fdans) p:05Triage→03Medium [16:52:05] 10Analytics, 10Analytics-Kanban, 10Analytics-Wikistats, 10I18n: [[Wikimedia:Wikistats-metrics-top-mediarequests-name/jam]] translation issue - https://phabricator.wikimedia.org/T266669 (10fdans) p:05Triage→03High a:03fdans [16:52:54] 10Analytics, 10Event-Platform: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10fdans) p:05Triage→03Medium [17:14:55] * elukey afk for ~30 mins! [18:20:41] (03PS4) 10Mforns: Add Refine transform function for Netflow data set [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) [18:21:05] are we going to deploy today? [18:22:31] (03CR) 10Mforns: [V: 03+2] "I added some unit tests." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [18:40:00] fdans: (if you are busy don't worry) - are we deploying today? [18:42:48] (03CR) 10Neil P. Quinn-WMF: Oozie job for Wikipedia Preview stats (035 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [18:43:24] (03CR) 10Neil P. Quinn-WMF: Oozie job for Wikipedia Preview stats (032 comments) [analytics/wmf-product/jobs] - 10https://gerrit.wikimedia.org/r/635578 (https://phabricator.wikimedia.org/T261953) (owner: 10Sbisson) [19:10:46] Hello A-team, [19:10:46] I'm unable to load Jupyter Lab on stat6 since the upgrade earlier in the month. I wonder if you have tips or recommendations? The default version is installed and I just checked with Morten and his has been working without issue. To note, I am able to use Jupyter notebooks, just unable to load labs. [19:11:39] elukey: since joseph is ooto let's push to next week I think [19:13:42] iflorez: hi, i can try and look into in a a few minutes, but first i hafta ask...does the new conda based jupyter work for you? [19:13:46] or is this the old swap stuff? [19:14:00] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Newpyter [19:26:04] * elukey afk! [19:53:35] yup, I was using the old swap setup. I will try the new conda setup. The docs say that's only on stat8 at the moment? [20:01:32] (03PS5) 10Mforns: Add Refine transform function for Netflow data set [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) [20:11:06] ottomata: does this make any sense? https://gerrit.wikimedia.org/r/c/operations/puppet/+/637559 [20:35:17] success! thank you @ottomata. [20:35:17] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Jupyter#Newpyter on stat6 worked. [21:09:56] great! [21:10:10] iflorez: that'll be the new and eventually only way to do things anyway :) SWAP will be deprecated eventually one day [21:10:19] thanks for using it! [21:11:43] mforns: just a thought: it might be nice to be able to use such a function to add isp data to other datasets too [21:11:46] but for now this is fine [21:12:02] i'm thining of cdan is' network error logging stuff [21:12:07] but we can adapt this for that then later [21:12:09] ottomata: how would that be then? [21:12:18] well, you have things like ip_src hardcoded in [21:12:33] maybe it'd be possible to make the ip column configurable [21:12:45] ottomata: oh, you mean the transform function? [21:12:50] yes [21:12:52] hmm [21:12:55] ah I see [21:13:01] would it be better to add this extra info as a single struct or map column [21:13:04] rather than all top level? [21:13:14] like we do for geocoded_data [21:13:15] ? [21:13:31] don't know... [21:14:08] i think it might, then we could more easily re-use it in other places, rather than adding a bunch of top level fields, right? [21:14:09] the fields are not really related to each other [21:14:16] oh no? [21:14:38] reading more...thought it was mostly ASN stuff [21:15:04] also, the transform function makes some assumptions about the format of the source dataset [21:15:12] huh.. [21:15:12] yeah [21:15:17] it needs 7 fields [21:15:17] hmmm [21:15:19] no 8 [21:15:25] yeahhh [21:15:50] ok, i'm only going to add one naming nit then, this would be annohing to make generic [21:15:53] i take it back [21:16:12] ip_src, ip_dst, comms, net_src, mask_src, net_dst, mask_dst, peer_ip_src [21:17:50] (03CR) 10Ottomata: [C: 03+1] "One minor naming nit, +1 LGTM" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [21:18:19] OH YOU were asking me about the puppet patch [21:18:25] sorry i dunno how i got to the reinfery patch [21:18:47] hehe, no it's good as well! [21:18:53] comment makes sense, will change [21:19:05] ya that looks fine too [21:19:10] yea? :] [21:19:12] k [21:19:22] don't know how to test that, though [21:19:55] running a PCC [21:20:58] https://puppet-compiler.wmflabs.org/compiler1003/26219/ [21:21:06] https://puppet-compiler.wmflabs.org/compiler1003/26219/an-launcher1002.eqiad.wmnet/fulldiff.html [21:21:19] hmm maybe we can rename that file mforns [21:21:26] network_infra_config not very apt, no? [21:21:27] lets see [21:22:00] aha [21:22:19] network_region_config [21:22:20] ? [21:22:21] maybe? [21:22:48] yes, fine by me! I just copied the current naming of the source data [21:23:06] but network_region_config sounds great [21:25:16] the output file looks good! [21:39:08] oh hm [21:39:47] hm [21:40:07] i dunno, guess if that is what they call it too [21:40:08] we should keep it [21:40:14] mforns: i'm fine either way then [21:40:39] whatever you prefer, I can change tomorrow! [21:40:58] ok mforns gonna sign off, ping me tomorrow and I can merge [21:41:08] ok! see yaa thanks!