[02:13:55] 10Analytics: virtualpageview_hourly lacks data from December 17 on - https://phabricator.wikimedia.org/T213602 (10Tbayer) This is affecting the monthly core readership metrics we are currently preparing for the board and the mw.org Audiences page, so a quick initial assessment (to see whether/how this could be f... [06:36:36] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic, 10User-Elukey: Support all Product Analytics data needs in the Data Lake - https://phabricator.wikimedia.org/T212172 (10Neil_P._Quinn_WMF) [06:41:37] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic, 10User-Elukey: Provide feature parity between the wiki replicas and the Analytics Data Lake - https://phabricator.wikimedia.org/T212172 (10Neil_P._Quinn_WMF) [06:43:53] 10Analytics: dbstore1002 crashed - https://phabricator.wikimedia.org/T213670 (10Marostegui) [06:46:39] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic, 10User-Elukey: Provide feature parity between the wiki replicas and the Analytics Data Lake - https://phabricator.wikimedia.org/T212172 (10Neil_P._Quinn_WMF) [06:49:25] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic, 10User-Elukey: Provide feature parity between the wiki replicas and the Analytics Data Lake - https://phabricator.wikimedia.org/T212172 (10Neil_P._Quinn_WMF) [07:43:46] 10Analytics: dbstore1002 crashed - https://phabricator.wikimedia.org/T213670 (10Marostegui) I have repaired and analyzed all the affected tables reported, but replication is still complaining about it. Also, I can do a select on the tables without any problem. ` zhwiki +---------------+---------+----------+-----... [07:43:53] helloooo [07:44:55] morning! [07:45:20] elukey: let's deploy superset at some point this morning? [07:45:34] sure, atm dbstore1002 is having issues, Manuel is debugging it,, [07:45:37] let's wait a bit [07:45:41] yessir [07:51:12] 10Analytics: dbstore1002 crashed - https://phabricator.wikimedia.org/T213670 (10elukey) Executed `bmc-device --debug --cold-reset` in localhost since the mgmt interface was not available ("No more sessions available"). [07:53:39] 10Analytics: dbstore1002 crashed - https://phabricator.wikimedia.org/T213670 (10elukey) `racadm getsel` for today (remember that one disk was already failed, we have a task about it): ` ------------------------------------------------------------------------------- Record: 7 Date/Time: 01/14/2019 07:48:1... [07:53:45] 10Analytics: dbstore1002 crashed - https://phabricator.wikimedia.org/T213670 (10Marostegui) I wanted to drop and rebuild the tables but I cannot even drop it: ` root@dbstore1002.eqiad.wmnet[eswiki]> drop table linter; ERROR 1030 (HY000): Got error 1 "Operation not permitted" from storage engine Aria root@dbstor... [07:57:37] 10Analytics: dbstore1002 crashed - https://phabricator.wikimedia.org/T213670 (10Marostegui) This is the degraded RAID task T206965 [08:12:31] 10Analytics: dbstore1002 crashed - https://phabricator.wikimedia.org/T213670 (10elukey) [08:13:14] 10Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10elukey) [08:49:25] a-team: heads up - mysql on dbstore1002 down due to maintenance [09:08:55] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmn... [09:24:38] 10Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) A mysql stop didn't work, neither a kill to mysqld, I had to do kill -9 to mysqld so mysqld_safe has restarted the process. It is starting up now...I will report back ` 190114 09:23:00 mysqld_safe Number of processe... [09:33:04] 10Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) MySQL is failing to start: ` 190114 9:32:14 [ERROR] Aria engine is not enabled or did not start. The Aria engine must be enabled to continue as mysqld was configured with --with-aria-tmp-tables 190114 9:32:14 [ERRO... [09:45:41] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1028.eqiad.wmnet', 'ana... [09:46:31] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmn... [09:52:38] 10Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) p:05Triage→03High a:03Marostegui [10:24:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1031.eqiad.wmnet', 'ana... [10:26:21] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmn... [10:55:10] 10Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10elukey) Update for all the users of dbstore1002: We are still not sure what happened to dbstore1002, but now mysql does not start due to some issues with Aria based tables. A lot of them (from 2014->2017) are in the `staging` d... [10:57:11] fdans: sorry I haven't forgot, but dbstore1002 is still down :( [10:57:25] hopefully we'll do the deploy this afternoon ok? [10:57:28] i know i know, no rush [10:57:32] of course! [10:57:37] let's also make sure to check the status of the current dashboards [10:58:02] so we'll be able to check straight away which ones are broken [10:58:04] hopefully 0 [11:01:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1032.eqiad.wmnet'] ` O... [11:08:18] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmn... [11:37:46] * elukey lunch! [11:38:22] (the dbstore1002's recovery process is still in progress, we don't have a clear ETA atm. The task has been updated and I have alerted people via email) [11:44:45] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1032.eqiad.wmnet'] ` a... [12:58:46] Hi team - I'm not feeling well today - Spent the morning in bed, and will probably go back in a bit - Hopefully up tomorrow :S [13:11:32] 10Analytics, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10leila) Cervisiarius brought this task up last week and mentioned that he and a master student of his are working on: 1. putting together the system for generating HTML dumps, 2. they will r... [13:18:58] 10Analytics, 10Datasets-Archiving, 10Research: Make HTML dumps available - https://phabricator.wikimedia.org/T182351 (10ArielGlenn) These are full html of the pages or 'just' of the parsed/rendered wikitext, or...? And, is there a notion of what the code looks like or what components are involved, so we can... [13:51:15] * elukey sends wikilove to joal [13:58:46] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmn... [14:03:15] 10Analytics: Check home leftovers of user imarlier (Ian Marlier) - https://phabricator.wikimedia.org/T213702 (10elukey) p:05Triage→03Normal [14:08:39] fdans: ---^ [14:08:58] did we decide about a special tag for who is on ops week? [14:09:01] I don't recall [14:09:18] ah no a column on the analytics board [14:09:20] elukey: I think the decision was to create a new column [14:09:43] elukey: we still have that other task, we didn't delete those, did we? [14:09:52] we did no [14:09:54] *not [14:09:57] (nithum and jamesur) [14:10:00] there is a ops week col [14:10:06] but not that task [14:11:00] ah it was in ops excellence [14:11:04] all right moving it to the new col [14:11:07] together with the new task [14:11:39] very nice [14:21:25] 10Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) Fixing all aria tables has finished. MySQL has started correctly and same for replication. The host is now lagging behind as it has been down for a while but it will eventually catch up: ` root@cumin1001:~# mysql.py... [14:34:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1033.eqiad.wmnet', 'ana... [14:44:27] 10Analytics: Convert Aria tables to InnoDB on dbstore1002 - https://phabricator.wikimedia.org/T213706 (10Marostegui) p:05Triage→03Normal [14:47:23] 10Analytics, 10User-Elukey: Convert Aria tables to InnoDB on dbstore1002 - https://phabricator.wikimedia.org/T213706 (10elukey) [14:49:42] 10Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10elukey) Update for all the users of dbstore1002: mysql is again up after some hours of downtime, usable but still catching up with replication. [14:50:34] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmn... [14:59:02] hi everyone. Office hours starting in 1 minute [14:59:12] WELCOME!!! [14:59:16] and ask us anything [14:59:26] we have some thoughts on here: https://wikitech.wikimedia.org/wiki/Analytics/Team/Office_Hours [15:02:56] :) office feels empty [15:05:51] Hello hello [15:06:01] hi GoranSM !! [15:06:06] And I thought we were about to have a Hangouts session [15:06:11] welcome, you are the first ever guest of the first ever office hours [15:06:16] milimetric: Hi Dan how are you [15:06:20] milimetric: HAHAHHAHH [15:06:23] GoranSM: we can switch to hangouts if needed, just starting in IRC to give everyone an easy chance [15:06:35] milimetric: Got it [15:09:42] heya teamm [15:10:11] hola Marcel [15:10:22] heyyy [15:23:16] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['stat1005.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-au... [15:26:22] !log reimage stat1005 - T205846 [15:26:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:26:25] T205846: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 [15:28:25] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1036.eqiad.wmnet', 'ana... [15:28:56] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmn... [15:29:11] * elukey afk for 10 mins [15:51:53] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10Marostegui) >>! In T206965#4694827, @Cmjohnson wrote: > @elukey dbstore1002 is out of warranty and has 1.2T disks. I don't have disks this size but can replace with a 2TB disk.. Let's do it Th... [15:52:43] 10Analytics, 10Operations, 10ops-eqiad: Degraded RAID on dbstore1002 - https://phabricator.wikimedia.org/T206965 (10elukey) a:03Cmjohnson [15:59:40] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['stat1005.eqiad.wmnet'] ` and were **ALL** successful. [16:01:29] ping elukey milimetric joal [16:01:36] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Move users from stat1005 to stat1007 - https://phabricator.wikimedia.org/T205846 (10elukey) [16:01:53] 10Analytics, 10Contributors-Analysis, 10Product-Analytics, 10Epic, 10User-Elukey: Provide feature parity between the wiki replicas and the Analytics Data Lake - https://phabricator.wikimedia.org/T212172 (10Tbayer) It seems we have collected enough use cases already to facilitate the present discussion, b... [16:02:01] ouch comingggg [16:02:10] (03PS1) 10Mforns: Make saltrotate store salts with timestamps as file name. [analytics/refinery] - 10https://gerrit.wikimedia.org/r/484250 (https://phabricator.wikimedia.org/T212014) [16:18:33] would anyone from the a-team be able to take a quick initial look at https://phabricator.wikimedia.org/T213602 today? it's blocking our monthly core metrics report to the board [16:18:51] HaeB: we are discussing it now :) [16:18:51] HaeB: we're looking at it this very second [16:19:01] great, thanks :) [16:19:15] thanks for the report, we're missing an alarm here, we should've seen this [16:20:51] 10Analytics: Alarms for virtualpageview should exist (probably in oozie) for jobs that have been idle too long - https://phabricator.wikimedia.org/T213716 (10Nuria) [16:22:04] 10Analytics, 10Analytics-EventLogging, 10EventBus, 10Security-Team, and 3 others: Modern Event Platform: Stream Intake Service: AJV usage security review - https://phabricator.wikimedia.org/T208251 (10sbassett) @Ottomata - The #security-team should be able to get a review scheduled for this soon. Just a... [16:22:51] 10Analytics: virtualpageview_hourly lacks data from December 17 on - https://phabricator.wikimedia.org/T213602 (10Nuria) There is no gap on incoming data: https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&from=now-34d&to=now [16:22:53] 10Analytics: virtualpageview_hourly lacks data from December 17 on - https://phabricator.wikimedia.org/T213602 (10fdans) p:05Triage→03Unbreak! a:03Ottomata [16:24:33] 10Analytics: virtualpageview_hourly lacks data from December 17 on - https://phabricator.wikimedia.org/T213602 (10Nuria) @Tbayer we will be working on seeing why this job stopped (it is missing a few hours arround september 17th) in the meantime to approximate data if this is urgent you can use eventlogging requ... [16:26:49] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Distinguish between types of block events in the Mediawiki user history table - https://phabricator.wikimedia.org/T213583 (10fdans) We'll take the ipblocks work after the addition of the tags to the data lake. [16:27:03] 10Analytics, 10Anti-Harassment, 10Product-Analytics: Distinguish between types of block events in the Mediawiki user history table - https://phabricator.wikimedia.org/T213583 (10fdans) p:05Triage→03Normal [16:28:14] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Update big spark jobs conf with better settings - https://phabricator.wikimedia.org/T213525 (10fdans) p:05Triage→03High [16:28:47] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Add 'mediawiki_history_unchecked' dataset to oozie - https://phabricator.wikimedia.org/T213524 (10fdans) p:05Triage→03Normal [16:30:11] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Set up a Analytics Hadoop test cluster in production that runs a configuration as close as possible to the current one. - https://phabricator.wikimedia.org/T212256 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['analytics1039.eqiad.wmnet'] ` O... [16:33:47] 10Analytics, 10Scoring-platform-team: Investigate formal test framework for Oozie jobs - https://phabricator.wikimedia.org/T213496 (10fdans) p:05Triage→03Low [16:36:24] 10Analytics, 10Product-Analytics: Superset's rolling average feature results in error message - https://phabricator.wikimedia.org/T213488 (10fdans) We're planning on upgrading superset to the latest release once the project is a bit more stable, which won't happen at least until after the all hands. Hopefully... [16:36:49] 10Analytics, 10Product-Analytics: Superset's rolling average feature results in error message - https://phabricator.wikimedia.org/T213488 (10fdans) p:05Triage→03High [16:37:42] 10Analytics, 10Analytics-Kanban, 10Chinese-Sites, 10Patch-For-Review: Add Chinese Wikiversity edit-related metrics to Wikistats 2 - https://phabricator.wikimedia.org/T213290 (10fdans) p:05Triage→03High [16:38:59] 10Analytics, 10MediaWiki-API, 10PageViewInfo, 10Pageviews-API: API Analytics - page views by country - https://phabricator.wikimedia.org/T213221 (10fdans) We only release pageview per country aggregated per project, not per article: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Pageviews_spl... [16:39:14] 10Analytics, 10MediaWiki-API, 10PageViewInfo, 10Pageviews-API: API Analytics - page views by country - https://phabricator.wikimedia.org/T213221 (10Milimetric) Example: https://wikimedia.org/api/rest_v1/metrics/pageviews/top-by-country/ro.wikipedia.org/all-access/2018/12 [16:40:21] !log running refine eventlogging analytics for dec 17 2018 12:00 - 16:00 - T213602 [16:40:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:40:24] T213602: virtualpageview_hourly lacks data from December 17 on - https://phabricator.wikimedia.org/T213602 [16:42:25] 10Analytics, 10MediaWiki-API, 10PageViewInfo, 10Pageviews-API: API Analytics - page views by country - https://phabricator.wikimedia.org/T213221 (10fdans) This is the study we made back when we developed the pageviews per country API: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews/Pageviews_b... [16:42:56] 10Analytics, 10MediaWiki-API, 10PageViewInfo, 10Pageviews-API: API Analytics - page views by country - https://phabricator.wikimedia.org/T213221 (10fdans) p:05Triage→03Low [16:43:53] 10Analytics, 10MediaWiki-API, 10PageViewInfo, 10Pageviews-API: API Analytics - page views by country - https://phabricator.wikimedia.org/T213221 (10fdans) And this is the task for the ongoing effort to expose per article country data. https://phabricator.wikimedia.org/T189339 [16:44:32] 10Analytics, 10Analytics-Dashiki, 10Analytics-Kanban, 10Patch-For-Review: Improve Dashiki defaults for Browser selection - https://phabricator.wikimedia.org/T213215 (10fdans) 05Open→03Resolved [16:47:09] 10Analytics: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10fdans) let's figure out whether we need to back up the staging database. [16:47:39] 10Analytics, 10Analytics-Kanban: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10fdans) [16:50:42] 10Analytics: Reportupdater should alert if it fails over and over - https://phabricator.wikimedia.org/T213309 (10fdans) a:03Milimetric [16:55:46] !log restart turnilo to pick up new changes [16:55:47] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:55:49] nuria: --^ [16:55:51] all ready [16:57:53] 10Analytics: Alarms for virtualpageview should exist (probably in oozie) for jobs that have been idle too long - https://phabricator.wikimedia.org/T213716 (10fdans) p:05Triage→03High [16:59:07] elukey: remember to get the dates for SRE offsite plis [17:01:02] ah yes hopefully we know today [17:01:37] 10Analytics, 10Analytics-Kanban: Reportupdater queries jobs failing - https://phabricator.wikimedia.org/T213219 (10fdans) [17:02:35] 10Analytics: Add site to piwik,wikimedia.org for grant metrics so we can measure traffic to tool - https://phabricator.wikimedia.org/T213735 (10Nuria) [17:04:27] 10Analytics: Add site to piwik,wikimedia.org for grant metrics so we can measure traffic to tool - https://phabricator.wikimedia.org/T213735 (10fdans) p:05Triage→03Normal [17:04:54] 10Analytics: Add site to piwik,wikimedia.org for grant metrics so we can measure traffic to tool - https://phabricator.wikimedia.org/T213735 (10Nuria) a:03Nuria [17:14:07] 10Analytics, 10Analytics-Kanban: dbstore1002 Mysql errors - https://phabricator.wikimedia.org/T213670 (10Marostegui) 05Open→03Resolved I am closing this as replication keeps catching up nicely without any errors ` root@cumin1001:/home/marostegui# mysql.py -hdbstore1002 -e "show all slaves status\G" | egrep... [17:19:34] 10Analytics: virtualpageview_hourly lacks data from December 17 on - https://phabricator.wikimedia.org/T213602 (10Ottomata) Thanks for filing Tilman, I'm refining this data now, and Oozie is scheduling the jobs now: https://hue.wikimedia.org/oozie/list_oozie_coordinator/0001715-180905070129339-oozie-oozi-C/ Thi... [17:31:33] musikanimal: yt? [17:32:40] 10Analytics: Add site to piwik,wikimedia.org for grant metrics so we can measure traffic to tool - https://phabricator.wikimedia.org/T213735 (10Nuria) Used this url: https://tools.wmflabs.org/grantmetrics/ [17:33:14] 10Analytics: Add site to piwik,wikimedia.org for grant metrics so we can measure traffic to tool - https://phabricator.wikimedia.org/T213735 (10Nuria) [17:37:07] 10Analytics: Add site to piwik,wikimedia.org for grant metrics so we can measure traffic to tool - https://phabricator.wikimedia.org/T213735 (10Nuria) Tracking code: