[01:23:16] (03CR) 10DannyS712: [C: 03+1] "apparently I don't have +2 here..." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/656267 (owner: 10Mholloway) [04:35:56] (03CR) 10Joal: Update logic per Isaac (031 comment) [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/656166 (https://phabricator.wikimedia.org/T271571) (owner: 10Milimetric) [04:59:27] 10Analytics, 10Analytics-Kanban, 10Pageviews-API: 404.php shows up in pageview API for 2017 - https://phabricator.wikimedia.org/T271870 (10JAllemandou) 05Open→03Declined It appears that the page-title `404.php` is a valid one as per our validation scheme. It is present in the pageview dataset on a regula... [05:08:36] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of dcipoletti - https://phabricator.wikimedia.org/T271092 (10JAllemandou) As part of the ops week I started this task. I updated the bash script in @elukey's link above to: * Not check `/var/userarchive/USERNAME.tar.bz2` on any host as I don't have sud... [05:28:33] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of kaldari - https://phabricator.wikimedia.org/T271089 (10JAllemandou) As part of the ops week I started this task. I updated the bash script in [[ https://wikitech.wikimedia.org/wiki/Analytics/Ops_week#Have_any_users_left_the_Foundation?]] to: * Check... [06:56:49] bonjour [07:00:15] joal: I think you left the sudo before 'ls' for /var/userarchive [07:00:22] (but the change is good for grep) [07:11:32] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of dcipoletti - https://phabricator.wikimedia.org/T271092 (10elukey) @razzi what I usually do is clean up directories anyway, like `ssh stat1004.eqiad.wmnet 'rm -rfv /srv/home/dcipolletti` etc.. (the -v lists files deleted). Please be very careful sin... [07:11:46] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of kaldari - https://phabricator.wikimedia.org/T271089 (10elukey) @razzi what I usually do is clean up directories anyway, like `ssh stat1004.eqiad.wmnet 'rm -rfv /srv/home/kaldari` etc.. (the -v lists files deleted). Please be very careful since we d... [07:12:42] 10Analytics: Decide to move or not to PrestoSQL - https://phabricator.wikimedia.org/T266640 (10elukey) Team decision to move to the last upstream of PrestoDB for the moment, since we are lagging ~20 versions :( [07:34:13] indeed elukey I forgot to remove the sudo :) [07:39:53] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Follow up on hdfs:///tmp perms issues after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10JAllemandou) [07:43:44] (03PS1) 10Joal: Change DataFrameToDruid base temporary path [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) [07:48:23] (03CR) 10Elukey: [C: 03+1] Change DataFrameToDruid base temporary path [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [08:04:26] (03Abandoned) 10Elukey: Add some info about building with Docker [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655087 (owner: 10Elukey) [08:04:34] (03Abandoned) 10Elukey: Drop Debian 9 Stretch support [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655067 (owner: 10Elukey) [08:04:39] (03Abandoned) 10Elukey: Update some pypi dependencies to latest versions [analytics/jupyterhub/deploy] - 10https://gerrit.wikimedia.org/r/655088 (owner: 10Elukey) [08:04:58] (03Abandoned) 10Elukey: WIP - Add netflow/flowset schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/608077 (https://phabricator.wikimedia.org/T248865) (owner: 10Elukey) [08:50:47] (03PS1) 10Joal: Update oozie jobs tmp folders for ownership/perms [analytics/refinery] - 10https://gerrit.wikimedia.org/r/656373 (https://phabricator.wikimedia.org/T271560) [08:52:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Follow up on hdfs:///tmp perms issues after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10JAllemandou) [09:04:56] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:39] checking [09:05:47] I guess that we are dropping datasources [09:06:06] yep the drop ran 5 mins ago [09:06:22] but so far it looks pretty good [09:07:06] elukey: Shall we tweak the number/speed of moving segments while reshuffling? [09:09:52] joal: if we dropped 2 datasources or similar this time (I don't see it from the metrics but there might be some delay) then the timeouts that we added to AQS worked a lot [09:10:00] no idea why only 1004 is complaining now [09:10:08] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:10:13] :) [09:10:18] ah there you go [09:10:43] let's see how things go, and how many hosts are impacted this time [09:12:34] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:15:06] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:15:26] so the metrics are not loading in wikistats [09:16:00] the historicals are really stuck [09:16:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Follow up on hdfs:///tmp perms issues after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10JAllemandou) [09:16:59] :( [09:17:24] elukey: possibly reducing the segment size (and therefore augmenting segment number) was a bad idea? [09:17:52] also elukey: maybe we can reduce the number of sources we keep to 3 instead of 6, it'd make the used-disk size smaller [09:18:39] joal: but we'd have the drop problem anyway no? [09:18:57] elukey: I assume so yes :s [09:19:12] elukey: maybe less data to reshuffle if less data present, but still [09:21:39] !log roll restart druid brokers on druid public - stuck after datasource drop [09:21:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:22:28] joal: I think that the brokers are suffering from the connection pile up, maybe the tighter timeout for aqs is also at fault [09:22:41] I took a thread dump [09:22:46] hm [09:22:47] but the historicals' logs were ok [09:22:56] the brokers' logs are full of timeouts [09:23:09] now if I am right we should see recovery [09:23:12] Ah - So tighter timeout from AQS means faster fail, means faster retry? [09:23:28] exactly yes, this is my theory [09:23:34] makes sense [09:23:47] the brokers take a ton of time to process all the queue of requests [09:23:54] wikistats works now [09:24:00] so I expect icinga to agree [09:24:17] I will follow up on druid users@ [09:25:04] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:25:04] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:25:18] hm [09:29:34] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:38:30] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:03:22] elukey: Heya - I need some explanation on lookups vs hiera please :) [10:03:58] sure [10:04:29] elukey: When using the lookup syntax, where is the parameter retrieved? [10:05:57] joal: it is the same as hiera, but hiera() is deprecated [10:06:06] it looks it up in hiera as well [10:06:17] Ah - so stuff is stored in hieredata folder, yaml files [10:06:24] ok [10:06:40] It's basically an abstraction and/or indirection [10:07:00] lookup *could* get stuff elsewhere than hiera, but in our case, it (usually) doesn't [10:07:10] makes sense [10:09:20] thanks elukey and klausman [10:09:21] joal: and the syntax changes, since to add default values you need to provide a map { 'default_value' => something } [10:11:28] yup I have seen that [10:12:27] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Follow up on hdfs:///tmp perms issues after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10JAllemandou) [10:55:16] elukey: would you have a minute to test some archiva config change with me please? [10:55:56] sure [10:56:13] elukey: in relation to https://phabricator.wikimedia.org/T272082 [10:57:33] urgh [10:59:22] joal: do you have a config to test? This seems something to report to upstream too [10:59:42] elukey: I have something I'd like to test yes please [11:00:22] joal: can you add what you want to test in the task? [11:00:33] (back in min) [11:10:23] aaaand public transport in Zurich has been shut down entirely due to snow [11:10:32] /o\ [11:12:21] joal: we need to test the change somewhere before applying [11:23:28] elukey: yeah that was my ask [11:30:26] (we decided to test in wmcs/cloud later on) [11:42:07] I am going to take a longer lunch break today, ttl! [13:30:06] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Zache) >>! In T215858#6744188, @Jhernandez wrote: >>>! In T215858#6742843, @Zache wr... [14:57:46] (03CR) 10Elukey: "I misread the change, and I have a follow up question.. why don't we use /tmp for this? I thought I had read /tmp/etc.., instead the dir i" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/656367 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [15:00:32] Gone for kids - back in ~2h [15:05:30] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Follow up on hdfs:///tmp perms issues after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10elukey) Sorry for the extra question, going to follow up in here as well.. is there a reason not to use /tmp/etc..? I think that creating /tmp_e... [16:28:22] (03CR) 10Ottomata: [C: 03+2] Fix README typo [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/656267 (owner: 10Mholloway) [16:59:43] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Follow up on hdfs:///tmp perms issues after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10JAllemandou) >>! In T271560#6751422, @elukey wrote: > Sorry for the extra question, going to follow up in here as well.. is there a reason not t... [17:02:11] mforns, milimetric - Would any of you be there and have a minute for me? [17:02:54] elukey: I tried to answer your comment in the task - The choice of base-folder is definitely up the air for now :) [17:04:13] also elukey - thanks a lot for the patch upstream on archiva! [17:04:51] joal: sure! [17:04:59] Hi milimetric :) [17:05:01] cave? [17:05:07] 1 min [17:05:14] np [17:06:41] there joal [17:09:22] joal: I changed your code change to https://gerrit.wikimedia.org/r/c/operations/debs/archiva/+/656448 [17:09:36] because I realized only afterwards that the file is part of the upstream release :( [17:10:12] (we have the source of 2.2.4 in apt wikimedia, so we need a patch in this case) [17:18:02] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Follow up on hdfs:///tmp perms issues after umask change on HDFS - https://phabricator.wikimedia.org/T271560 (10elukey) I think that a directory named /tmp_something is a little confusing, so anything under /tmp seems more precise to me, but I am ok if the... [18:11:15] joal: let's sync before the weekend ok? [18:11:20] (if you have time) [18:11:33] elukey: we're in da cave with mforns - You're welcome to join [18:11:43] ah yes [18:23:46] 10Analytics: Check data currently stored on thorium and drop what it is not needed anymore - https://phabricator.wikimedia.org/T265971 (10elukey) @JAllemandou moving the request to you due to the ops week :) Would you check with me that `/wmf/data/archive/backup/misc/thorium` contains all the data on `thorium:/... [18:27:21] razzi: did you see https://gerrit.wikimedia.org/r/c/operations/puppet/+/654558 ? [18:27:43] Brooke created a new role for us, we should check it and see if anything is missing [18:27:58] (it will be used by the new version of labsdb1012) [18:29:44] elukey: yeah, saw it, and have to do some reading on the context [18:41:46] razzi: let's add a comment in the change so they are aware, with a timeline about when we'll review it [18:42:15] the next step is to think about planning the labsdb1012 upgrade, since there are multiple things to keep in mind [18:42:25] 1) needs to be done between sqoop (so within a month) [18:42:35] 2) reloading data takes ~1 week, but it could be more [18:43:04] so I'd say that we should aim to do the maintenance in the first week of Feb [18:45:20] razzi: another good thing - there is a cookbook called sre.host.reboot-single, that is really handy for simple reboot use cases [18:45:43] today I used it for some hosts, and also for an-coord100x nodes (zookeeper) [18:46:01] elukey: cool, I imagine for some small clusters it'd be easier to run that than to write a new cookbook [18:46:07] I didn't create a cookbook for those since zookeeper in most of the cases is co-located, so rebooting is tricky [18:46:10] yes exactly [18:46:19] today I moved it to the class api to it logs in the SAL the hostname [18:46:25] and it works really nicely [18:46:31] cool [18:47:04] there is also a code review for the new version of the hadoop reboot workers cookbook, once done we'll be able to reboot the nodes [18:47:13] early next week should be good [18:47:17] elukey: unrelated, I saw the aqs alerts from earlier - as I understand that's using the druid public cluster, so I don't think it would have anything to do with the druid analytics reboot I did yesterday, correct? [18:47:18] (we can do it together) [18:48:00] razzi: correct yes, it was the druid datasource drop timer that acted, and the brokers locked up.. I think the last changes in aqs didn't help as I hoped, I had to roll restart the brokers manually [18:48:08] didn't answer in alerts@, doing it now [18:48:28] basically the historicals lock up for a bit when datasources are dropped, and brokers pile up connections [18:48:53] I will try to follow up in druid users@ (every apache project has a users@ mailing list) [18:55:19] leaving for dinner, will check later on if anything is needed :) [18:56:31] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Jhernandez) >>! In T215858#6751253, @Zache wrote: > It would help to see what the op... [19:13:42] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [19:33:30] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) >>! In T260445#6729322, @Cmjohnson wrote: > @robh can you complete the off-site work for an-worker1118-1138. Still needs dhcpd file updated and may... [19:35:41] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) Ok, it was set to a temp, easy to type password via crash cart for the initial racking. I got the temp password from Chris, so I'll just update the... [19:38:12] 10Analytics, 10Event-Platform: Some refined events folders contain no data while they should - https://phabricator.wikimedia.org/T272177 (10JAllemandou) [19:41:35] 10Analytics, 10Event-Platform: Some refined events folders contain no data while they should - https://phabricator.wikimedia.org/T272177 (10JAllemandou) [19:41:43] mforns: can you please proof read that ticket? --^ [19:41:52] lookin! [19:43:20] joal: LGTM!! [19:43:24] \o/ [19:43:32] joal: one question [19:43:36] sure [19:44:00] for Presto, if the data size is smallish [19:44:12] but the number of rows is big, [19:44:21] does that affect performance? [19:44:46] Intuitively I'd think yes [19:45:22] I was thinking, regarding session length data, it is small, but it is long... [19:45:38] The more rows the more computation for them, but here we're fine I think [19:46:05] Will Presto be able to compute queries on top of billons of rows? [19:46:19] queries=percentiles [19:46:41] mforns: with the approx function, I assume so - I have not tested though [19:46:49] aha, ok [19:46:57] will test also :] [19:47:04] thanksss [19:47:24] np mforns :) Leaving for now - have a goodend folks [19:47:37] bye joal nice weekeneddd! [19:57:57] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [19:59:17] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) Please note the correct IPMI password has been set for all of these hosts EXCEPT THE FOLLOWING as they are NOT online or racked: an-worker11[29,33,34... [22:33:29] 10Analytics, 10Data-Services, 10cloud-services-team (Kanban): Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema - https://phabricator.wikimedia.org/T215858 (10Ladsgroup) >>! In T215858#6751253, @Zache wrote: > Do you have any analysis on this?... [22:44:11] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` Th... [22:46:44] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [22:58:19] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1118.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1118.... [23:24:26] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) [23:31:12] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` The log can be found in... [23:41:54] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1118.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1118.eqiad.wmnet'] ` [23:46:08] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` an-worker1118.eqiad.wmnet ` The log can be found in...