[05:04:05] 10Analytics, 10Analytics-EventLogging: Consider how to best architect transmission of events - https://phabricator.wikimedia.org/T240454 (10Aklapper) [Please make sure that open tasks have active project tags, so these tasks can be found when looking at workboard - thanks a lot!] [07:10:20] 10Analytics, 10Data-Engineering: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10jcrespo) [07:14:29] 10Analytics, 10Data-Engineering: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10jcrespo) This is not the first time it happens, and seems specific to analytics dbs: T270112 [07:14:49] 10Analytics, 10Data-Engineering: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10jcrespo) [07:14:51] 10Analytics-Radar, 10DBA: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking - https://phabricator.wikimedia.org/T270112 (10jcrespo) [07:38:27] 10Analytics: Kerberos identity for kevinbazira - https://phabricator.wikimedia.org/T290843 (10kevinbazira) [08:00:10] 10Analytics: Kerberos identity for kevinbazira - https://phabricator.wikimedia.org/T290843 (10kevinbazira) [08:23:28] 10Analytics: Kerberos identity for kevinbazira - https://phabricator.wikimedia.org/T290843 (10elukey) 05Open→03Resolved a:03elukey Followed up with Kevin, the krb flag is already present, credentials working, all good! [11:58:57] 10Analytics, 10Data-Engineering, 10Patch-For-Review: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10Marostegui) I have merged the above patch to decrease mysql buffer pool sizes for all the instances. This requires mys... [12:49:02] (03PS5) 10Michael DiPietro: add stop status [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) [13:42:26] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10BTullis) This might be a daft question, given the work that has already gone into the chang... [13:42:58] 10Quarry, 10cloud-services-team (Kanban): Quarry returns 500 rather than 404 when asked for an invalid quarry ID - https://phabricator.wikimedia.org/T290874 (10Andrew) [13:58:51] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Improve user experience for Kerberos by creating automatic token renewal service - https://phabricator.wikimedia.org/T268985 (10elukey) I think it is good to have something that needs to auto-renew periodically after 2d... [14:10:12] (03PS11) 10Andrew Bogott: test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 [14:14:02] (03PS1) 10Ladsgroup: Fix file permission of recentchanges tags [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/720762 (https://phabricator.wikimedia.org/T236893) [14:17:07] (03CR) 10Ladsgroup: "Adding Lucas as an expert in this issue." [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/720762 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [14:29:12] (03PS6) 10Michael DiPietro: add stop status [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) [14:32:44] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix file permission of recentchanges tags (031 comment) [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/720762 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [14:34:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Maybe we should just change the cron scripts to run the commands with `php`, so they don’t all need to be executable. (Also, why do the sc" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/720762 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [14:34:23] (03Merged) 10jenkins-bot: Fix file permission of recentchanges tags [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/720762 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [14:35:52] (03PS1) 10Ladsgroup: Fix file permission of recentchanges tags [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/720697 (https://phabricator.wikimedia.org/T236893) [14:35:57] (03CR) 10Ladsgroup: [C: 03+2] Fix file permission of recentchanges tags [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/720697 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [14:36:55] (03Merged) 10jenkins-bot: Fix file permission of recentchanges tags [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/720697 (https://phabricator.wikimedia.org/T236893) (owner: 10Ladsgroup) [14:48:17] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Wikidata, and 3 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10mforns) @Michael Hi! I'm going to migrate this schema during the next couple weeks. I need to... [14:59:09] 10Analytics, 10Analytics-EventLogging, 10Event-Platform, 10Wikidata, and 3 others: Migrate WikibaseTermboxInteraction EventLogging Schema to new EventPlatform thingy - https://phabricator.wikimedia.org/T290303 (10Michael) >>! In T290303#7348565, @mforns wrote: > @Michael Hi! > > I'm going to migrate this... [15:02:50] btullis: standup? [15:03:13] oops sorry btullis - disregard [15:14:54] (03CR) 10Michael DiPietro: add stop status (031 comment) [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/719567 (https://phabricator.wikimedia.org/T289349) (owner: 10Michael DiPietro) [15:46:37] 10Analytics, 10Data-Engineering: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10odimitrijevic) p:05Triage→03High a:03razzi [15:49:10] (03CR) 10Milimetric: [C: 03+2] "I think it's a good experiment, so I'm for it. Feel free to verify and merge whenever you want to try it." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/720317 (https://phabricator.wikimedia.org/T290723) (owner: 10Joal) [15:50:13] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10odimitrijevic) p:05Triage→03Medium a:03razzi [15:51:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Investigate why gobblin pulls webrequest data late - https://phabricator.wikimedia.org/T290723 (10odimitrijevic) p:05Triage→03High [15:52:05] 10Analytics: Check home/HDFS leftovers of jmads - https://phabricator.wikimedia.org/T290715 (10odimitrijevic) p:05Triage→03High [16:17:41] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10elukey) ` elukey@an-worker1096:~$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components in optimal state) name: Adapter #0 Virtual Drive: 6 (Target... [17:37:08] Update on dbstore1007: it's still hovering around 96% memory usage, and I'm chatting with jynus on #wikimedia-data-persistence about rebooting mariadb to see if that improves things [17:37:50] I'm still not sure who / what is putting so much memory pressure on [17:38:14] razzi: I'm interested if we h [17:38:31] razzi: I'm interested if we have some kind of query logs for instance maybe? (sorry for the wrong message) [17:39:11] I'm going to try `show processlist` to see what is happening [17:39:44] just looking up the command to connect to a specific db section. There's 3, so it'll be a good start to figure out what section(s) are taking traffic [17:41:23] Thanks razzi for looking into this [17:41:28] Gone for today team [17:41:44] cya joal ! [17:42:17] s2 looks chill [17:43:05] Hm, all sections actually look normal from a `show processlist;` perspective [17:43:18] but then again I don't know exactly what I'm looking at [17:43:43] But the only "Query" command is the `show processlist` I just ran... [17:44:11] So hopefully restarting the mysql process will free up the memory, and nothing is actually actively using it? [17:53:03] 10Analytics, 10SRE, 10ops-eqiad: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10Cmjohnson) A new disk has been ordered and will be here this week. You have successfully submitted request SR1070175430. [18:05:13] !log sudo systemctl restart mariadb@s2.service [18:05:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:13:01] !log razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841 [18:13:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:13:04] T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 [18:17:47] 10Analytics, 10Analytics-Kanban: Fix `wmf.editors_daily` data deletion - https://phabricator.wikimedia.org/T290093 (10razzi) Merged the patch for this, thinking of letting auto puppet run take care of it, let me know if there's some manual step I missed. [18:19:29] !log razzi@dbstore1007:~$ sudo systemctl restart mariadb@s4.service for T290841 [18:19:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:19:32] T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 [18:24:57] !log razzi@dbstore1007:~$ for socket in /run/mysqld/*; do sudo mysql --socket=$socket -e "START SLAVE"; done - reenable replication for T290841 [18:25:00] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:25:00] T290841: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 [18:25:12] !log (I stopped replication earlier but forgot to !log) [18:25:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:32:26] 10Analytics, 10Data-Engineering: dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error - https://phabricator.wikimedia.org/T290841 (10razzi) Restarting the 3 mysqld sections put the memory into a reasonable 14% usage. It's possible there's something leaking memory however a... [18:33:29] Restarted the mysqld processes on dbstore1007 and memory is under control [18:33:55] it's still rising, and I expect it'll continue to rise for a while, but hopefully it'll hover at a reasonable (<80%) threshold eventually [18:34:27] jynus suggested a user query / stored procedure could be leaking memory; that could use some investigation [18:52:39] 10Analytics, 10Data-Engineering, 10Growth-Team, 10Metrics-Platform, and 4 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10nettrom_WMF) Hi all, coming back to this as I've been OoO. As mentioned in T288853#7288436, having... [19:00:30] 10Quarry, 10cloud-services-team (Kanban): Quarry returns 500 rather than 404 when asked for an invalid query ID - https://phabricator.wikimedia.org/T290874 (10Andrew) [19:18:09] PROBLEM - Hadoop NodeManager on an-worker1096 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:42:57] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10fkaelin) Submitting a spark job from an airflow instance results in a hadoop/hdfs permission error `Access... [20:39:28] 10Analytics: hdfs directory for analytics-research - https://phabricator.wikimedia.org/T290918 (10Milimetric) [20:40:03] 10Analytics, 10Analytics-Kanban, 10Platform Engineering, 10Research, 10Patch-For-Review: Create airflow instances for Platform Engineering and Research - https://phabricator.wikimedia.org/T284225 (10Milimetric) I'm a little fuzzy here but I do know this is because there's no `/user/analytics-research` di... [20:48:40] * razzi breaking to take a nap [21:46:33] (03PS12) 10Andrew Bogott: test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 [21:46:41] (03CR) 10jerkins-bot: [V: 04-1] test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353 (owner: 10Andrew Bogott) [21:55:41] (03PS13) 10Andrew Bogott: test query routes [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/720353