[01:28:43] phab down! [06:09:27] !log decommission analytics1050 from the hadoop cluster [06:09:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:09:37] good morning :) [06:26:06] Good morning! [07:03:27] !log restart oozie to pick up the analytics team's admin list [07:03:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:05:17] 2020-10-15 07:03:20,281 INFO AuthorizationService:520 - SERVER[an-coord1001.eqiad.wmnet] Oozie running with authorization enabled [07:05:20] 2020-10-15 07:03:20,282 INFO AuthorizationService:520 - SERVER[an-coord1001.eqiad.wmnet] Admin users will be checked against the 'adminusers.txt' file contents [07:05:23] joal: --^ [07:05:41] this shouldn't change much for us, but as fyi [08:00:24] joal: time to chat about an-coord failover plans? [08:13:55] also we need to order the new aqs nodes, so we can review the specs together to see if we need to change something [08:17:54] https://grafana.wikimedia.org/d/XhFPDdMGz/cluster-overview?orgId=1&var-site=eqiad&var-cluster=aqs&var-instance=All&var-datasource=thanos&from=now-7d&to=now [08:32:04] (going afk for a bit) [09:04:23] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:19] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:06:55] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:07:43] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:07:53] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:09:29] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:16:36] ah snap this is surely the data drop timer [09:18:49] yep [09:19:13] ok so I think it is time to raise the number of connections available on the historicals [10:03:45] Heyooo [10:04:08] Man, I went to bed at 2330 or so and slept a full 12h :D [10:05:03] :) [10:10:21] Sometimes, this happens when I'm fighting a cold, but I don't feel sick, so maybe I'm just starting to hibernate for winter :D [10:27:39] heya elukey - sorry I missed your ping [10:28:07] elukey: I'm guessing you're gonna go for lunch soon - shall we talk failover later in the day? [10:29:12] joal: we can do it in a few if it is ok for you, I can take lunch later [10:29:22] as you wish elukey [10:30:09] joal: ok for me now then :) [10:30:21] To the cave elukey! [10:35:45] klausman: after lunch if you have time we can go through (over meet) the test cluster etc.. [10:36:12] I'll do the same with razzi later on, so we don't need to sync for a meeting either too early for SF or too late for you :) [10:41:44] Sounds good [11:32:57] (03CR) 10Joal: [C: 03+1] "LGTM" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/633800 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [11:36:13] lunch! [11:47:56] (03CR) 10Joal: [C: 04-1] "Answer to one comment inline - still 2 fixes to provide (patch 3 was the same as patch 2) - Adding also a request to change the commit mes" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/607361 (https://phabricator.wikimedia.org/T256050) (owner: 10Conniecc1) [12:05:01] quick lunch run, back in 30m-45m [12:53:33] * klausman back [13:12:52] elukey: read when you are [13:12:57] ready* [13:15:54] klausman: I am now, bc? [13:16:25] be there in a sec [13:38:22] (03CR) 10Ottomata: [C: 03+2] Use camus + EventStreamConfig integration in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:38:51] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use camus + EventStreamConfig integration in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:43:12] (03Merged) 10jenkins-bot: Use camus + EventStreamConfig integration in CamusPartitionChecker [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629377 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:43:44] (03PS2) 10Ottomata: Add camus-wmf-0.1.0-wmf12.jar with EventStreamConfig support [analytics/refinery] - 10https://gerrit.wikimedia.org/r/633800 (https://phabricator.wikimedia.org/T251609) [13:49:53] hellooooo [13:52:17] yoohoo [13:53:33] elukey: o/ is analytics_test_cluster::coordinator alive somewhere? [13:53:38] trying to log into analytics10303 [13:53:44] but i'm guessing its decommed? [13:53:48] 1030* [13:54:29] oh wait, i bet my puppet repo is not fresh... [13:56:02] found it nm! [13:56:25] it is yes! but I am bootstrapping it, not really ready [13:56:31] is camus running? [13:56:45] not sure, I am still fixing problems on it [13:57:00] ohhhk [13:57:07] sorry :( [14:01:08] a-team moving grooming calendar event to after standup,, right now it's all weird [14:01:50] joal: today was 6pm standup day right? [14:13:48] My calendar says so [14:51:42] !log roll restart druid-historical daemons on druid1004-1008 to pick up new conn pooling changes [14:51:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:57:40] Hi team - usually Thursday's standup is now (5pm CEST) - I think we moved it last week because of monthly tech update and forgot to put it in place [14:57:49] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:59:04] this is me of course [14:59:16] yeah, just noticed elukey :) [14:59:40] fdans: see my note on standup times - just above - I was gone for kids at the moment you pinged me [14:59:59] ha a-team should we do standup now then? [15:00:16] ok by me [15:00:19] +1 [15:00:22] ack [15:00:25] Be there in 2 minutes [15:01:09] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:03:19] ottomata: we're doing standup [15:03:28] nuria: ping as well in case [15:03:31] oh [15:11:00] klausman: we forgot to ping you explicitely I think - We're doing standup now [15:16:02] oops [16:30:44] Gone for diner - back after [17:13:52] razzi: do you want to merge your change? [17:16:58] elukey: Yeah, I can merge them. Am I understanding it correctly that low risk infrastructure changes like the maxmind archiving one can be deployed pretty much whenever (though Fridays are generally to be avoided) whereas other changes should go with the weekly train? [17:17:55] razzi: so our puppet changes can be decoupled from the weekly train (unless there is a dependency etc..), and if possible we should avoid doing invasive changes Friday before leaving :D [17:22:32] in other news, we have an-coord1002! [17:22:33] elukey@an-coord1002:~$ uptime 17:21:56 up 8 min, 1 user, load average: 1.28, 1.68, 1.09 [17:23:25] \o/ an-coord1002 [17:24:44] elukey: I'm looking at Envoy for a bit with ottomata, then my plan is to merge the maxmind puppet stuff in an hour [17:25:30] razzi: sure, if you are working with Andrew then I'd log off, we can look at hadoop nodes/test-cluster/etc.. tomorrow [17:25:59] elukey: sounds good, ttyt [17:29:12] * elukey afk! [17:57:09] !log taking yarn.wikimedia.org offline momentarily to test new tls configuration: T240439 [17:57:12] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:57:13] T240439: Move https termination from nginx to envoy (if possible) - https://phabricator.wikimedia.org/T240439 [19:00:59] fdans: question on pageviews_complete, what is the plan for monthly files, are we putting those out the same way that Erik did? [19:03:03] milimetric: nope, that's a feature that needs to be tasked [19:03:34] fdans: so then we should fix erik's job, because this means we won't have monthly dumps for a while [19:03:42] (I can fix it, just thinking out loud) [19:04:30] milimetric: are people asking about monthly data? [19:04:47] yes, and daily data, and I'm doing a bad job of relaying it to the rest of the team but it's becoming a nuisance [19:09:48] ottomata: I think I need someone with permissions to help me debug this job [19:10:17] sure [19:10:19] it's probably some cron or something that's broken, it was producing https://dumps.wikimedia.org/other/pagecounts-ez/ [19:10:20] what's up milimetric [19:10:41] and it stopped on Sep. 24th https://dumps.wikimedia.org/other/pagecounts-ez/merged/2020/2020-09/ [19:11:20] k gimme reminder context, is this an oozie job that transforms pageviews? [19:11:28] oh right [19:11:30] and archives to a file [19:11:34] then something copies it over.... [19:12:08] milimetric: does the data exist in the source? [19:12:10] in hdfs i guess? [19:12:40] no, I think this is an Erik job on one of the stat boxes, probably stat1007? [19:12:51] https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/profile/templates/dumps/distribution/datasets/rsyncd.conf.pagecounts_ez.erb [19:12:54] https://github.com/wikimedia/puppet/blob/baf2a4e940d3d02304f9a8f868121f002d6db74e/modules/dumps/manifests/web/fetches/stat_dumps.pp [19:13:16] # Defines rsync jobs that fetch various datasets from stat1007, [19:13:16] # generated locally by ezachte. [19:13:34] oh [19:13:37] yes stat1007 [19:13:38] ok [19:13:45] i didn't realize we still run the erik crons [19:13:52] ok yeah, on stat1007 the data is not there [19:13:56] /srv/dumps/pagecounts-ez/merged/2020/2020-09 [19:14:05] https://github.com/wikimedia/puppet/blob/f047f77f5eb2c5abf5f092cdf0d53bfa34730741/modules/profile/manifests/dumps/distribution/datasets/fetcher.pp#L25 [19:14:24] ya found that too [19:14:25] right [19:14:27] ok, do you know what makes this data? [19:15:04] I thought it was a cron [19:15:17] running as erik... but luca and I looked at it in the spring... maybe [19:15:35] and we were going to migrate it, then realized we should just fix the whole dataset, which Fran just did [19:15:57] doesn'tt look like it... [19:16:02] i don't see any relevant cron or systemd timer [19:16:15] i thought we somehow generated these from hadoop? [19:16:23] i thought it was just the pageview data but in ez format? [19:17:11] no idea really though [19:19:05] ottomata: yeah, Fran is working on that, but that's the new approach, generating it in Hadoop as part of the normal pipeline. This old pipeline was starting somehow on stat1007, that's why the rsync is there [19:19:15] so the rsync is still there, but maybe the crons got deleted on reimage? [19:19:44] although that just happened, it wasn't on September 25th [19:24:51] ottomata: basically, the reason this matters is that this is the main pageview dumps that people use, and we don't have a full replacement, and about four separate people have complained about it being gone [19:25:12] I'm doing some searching on https://phabricator.wikimedia.org/T238243 [19:25:20] but it looks like the crons got deleted is my best guess [19:26:21] this confirms it was a cron: https://phabricator.wikimedia.org/T238243#5742450 [19:26:27] "One is running fine (compressing page view counts into daily/monthly zips for 3rd parties)." [19:30:23] :( [19:30:24] "It is very likely they will break again when we move boxes/users/permits so in the absence of a full rewrite I also vote for turning them off." [19:32:10] OH milimetric [19:32:12] that makes sense [19:32:15] they weren't puppetized at all [19:32:18] right? [19:32:21] right [19:32:36] yikes [19:32:38] yeah, but we knew that, just forgot [19:32:41] uhhhh [19:32:55] unless we can figure out somehow what command invocation was used to generate those.... [19:33:01] mayyyyyybe they are in bacula? [19:33:14] I could swear Luca and I backed them up somewhere [19:33:15] i will try to check [19:33:30] don't worry ottomata, I'll send an email and maybe it'll jog Luca's memory [19:35:59] okj [19:37:30] milimetric: no backupes of stat1007 on bacula [19:37:38] thx [19:44:40] (03PS1) 10Mforns: Add Refine transform function for Netflow data set [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) [19:50:37] razzi: I added you to this code review ^ But it's optional, OK? This change needs a lot of context, that probably you still don't have, but I think it can be interesting for you to read. Maybe if/when you've read it, we can meet and discuss! [19:53:48] mforns: Sounds good [19:54:04] :] [19:55:19] (03CR) 10Mforns: [V: 03+2] "I tested this by refining existing Netflow data into my own database and data directory." [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [20:03:53] mforns: care to give me a demo of how code like that can be run? [20:08:22] (03CR) 10Ottomata: "One nit but LGTM!" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [20:09:47] razzi sure! [20:09:51] wanna bc? [20:09:54] ya [20:09:57] k [20:41:19] razzi, it's super choppy [20:49:23] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Creation of canonical pageview dumps for users to download - https://phabricator.wikimedia.org/T251777 (10kzimmerman) [20:51:36] (03CR) 10Razzi: "Could we add some tests?" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/634328 (https://phabricator.wikimedia.org/T254332) (owner: 10Mforns) [20:58:00] 10Analytics, 10Event-Platform, 10Platform Team Workboards (Clinic Duty Team), 10Test-Coverage: EventBus tests are failing as they are hitting actual HTTP - https://phabricator.wikimedia.org/T265663 (10Pchelolo) Actually, this is a more interesting issue. If EventBus extension is enabled while running tests... [21:19:27] * razzi out for a walk [23:13:22] 10Analytics, 10Analytics-Wikistats, 10Inuka-Team, 10Language-strategy, and 2 others: Have a way to show the most popular pages per country - https://phabricator.wikimedia.org/T207171 (10Nuria) Have in mind that per population data is not necessarily needed (it will be great to have at some point but it fee...