HC1: Operations SIG 26 Aug 2024

From Zenon Wiki
Revision as of 02:19, 30 August 2024 by 0x3639 (talk | contribs) (→‎Minutes)
Jump to navigation Jump to search

Agenda

What: Meeting to Discuss Improving Node Operations as part of the HC1: Operations SIG

When: 26 Aug 2024 @ 6PM EST

Where: https://element.zenon.chat/#/room/#sig-operations:hc1.chat 4

Chair: 0x3639

Agenda:

  1. Discuss follow Up items from previous meeting
  2. Document action items
  3. Establish next meeting

If you want to attend please respond (or DM) with your full matrix username and I will invite you to the group. No FUD, anger or BS allowed.

Pre-meeting Notes

0x3639

  • Added `grafana.sh` https://github.com/go-zenon/go/blob/main/grafana.sh. This automates the installation of grafana, node_exporter & promethesus. It creates a default promethesus datasource, scrapes the node_exporter endpoint, and installs a node_exporter dashboard. Tested on amd64. Need to add arm64 support.
  • Started to investigate a custom dashboard for znnd. I created this for docker previous. It leveraged the JSON API data source for Grafana. However, this plugin is now in maintenance mode, no new features will be added. Grafana recommends using the Infinity data source plugin instead.
  • I started to investigate the Infinity data source plugin. It will be used to scrape the api endpoints to report `syncStatus` and other important metrics.
  • Next we can consider installing Loki to manage log files. We can discuss at the meeting.

George

Coinselor

  • Made ASCII Art more readable at lower resolutions.
  • Added --help flag https://github.com/go-zenon/go/pull/8
  • I can test arm64 support, will be spawning an arm VPS for the Supernova testnet.

Minutes

Mon, Aug 26, 2024, 17:03:11 - deeznnutz: Thx everyone for contributing to the go-zenon bash script. We are making good progress.

Mon, Aug 26, 2024, 17:03:21 - deeznnutz: I merged in @coinselor's PR #8 to improve the ASCII art and add a --help flag.

Mon, Aug 26, 2024, 17:03:34 - deeznnutz: those changes were pretty straight forward

Mon, Aug 26, 2024, 17:03:42 - deeznnutz: George submitted the PR for arm64 support. I have not tested it yet. Once we test it we can pull in that change. It's pretty simple. He submitted an issue to make sure the script checks for apt and systemd. Should we clarify that as a requirement or have the script check for the proper operating system and systemd?

Mon, Aug 26, 2024, 17:04:05 - georgezgeorgez: I think it's fine just to document it for now.

Mon, Aug 26, 2024, 17:04:15 - georgezgeorgez: I think it's okay for us to do 1 deployment target really well first.

Mon, Aug 26, 2024, 17:04:45 - georgezgeorgez: The people who need the most support will probably be choosing ubuntu/deb as their recommended OS.

Mon, Aug 26, 2024, 17:04:51 - deeznnutz: ya, makes sense. Should we check in the script and halt it if apt and systemd are not present?

Mon, Aug 26, 2024, 17:05:25 - georgezgeorgez: We could do that, but not a priority.

Mon, Aug 26, 2024, 17:05:46 - deeznnutz: OK - I can add that as a todo and we can deal with it later.

Mon, Aug 26, 2024, 17:05:47 - georgezgeorgez: We should try and get someone to use this script in the wild asap and get information about their nodes via the monitoring

Mon, Aug 26, 2024, 17:06:03 - deeznnutz: I setup a stand alone script to automates the installation of grafana, node_exporter & promethesus. It creates a default promethesus datasource, scrapes the node_exporter endpoint, and installs a default node_exporter dashboard. It currently only works on amd64.

Mon, Aug 26, 2024, 17:06:21 - deeznnutz: TODO

  • We need to expand functionality to arm64
  • Add a custom dashboard for znnd. This will require installing the Infinity data plugin and adding a new datasource (you can add an API endpoint as a datasource and it scrapes the API at x interval).
  • Potential znnd metrics to show:
    • Sync status
    • currentHeight
    • targetHeight
    • version
    • commit
    • numPeers
    • stats.osInfo
    • What else should we include? Mon, Aug 26, 2024, 17:06:35 - georgezgeorgez: What is the infinity data plugin? Mon, Aug 26, 2024, 17:06:45 - deeznnutz: it's a plugin that allows curl calls Mon, Aug 26, 2024, 17:07:03 - deeznnutz: it basically runs them on a schedule and then you can display the data in a dashboard Mon, Aug 26, 2024, 17:07:22 - georgezgeorgez: gotcha. That might be the fastest way Mon, Aug 26, 2024, 17:07:34 - georgezgeorgez: There could be other relatively quick methods like parsing logs Mon, Aug 26, 2024, 17:07:54 - deeznnutz: previously I used JSON API and it worked great. but that plugin is no longer under development Mon, Aug 26, 2024, 17:08:04 - coinselor: I think syrius shows quite a few znnd metrics, we could use that as reference Mon, Aug 26, 2024, 17:08:49 - georgezgeorgez: Long term, I think we should consider building metrics into the node I think https://opentelemetry.io/ is worth considering, but not really the next step for us Mon, Aug 26, 2024, 17:09:15 - deeznnutz: that would be awesome. Mon, Aug 26, 2024, 17:09:34 - georgezgeorgez: In terms of other metrics, what would help us debug a production issue or a testnet failure? Mon, Aug 26, 2024, 17:09:45 - georgezgeorgez: We might need different dashboards for prod and dev envs Mon, Aug 26, 2024, 17:10:00 - deeznnutz: We can add Loki the log processor Mon, Aug 26, 2024, 17:10:15 - deeznnutz: I've tested that before. it can parse all the logs and you can display them any way you want Mon, Aug 26, 2024, 17:10:48 - georgezgeorgez: Grafana has something called the LGTM stack https://grafana.com/go/webinar/getting-started-with-grafana-lgtm-stack/ Mon, Aug 26, 2024, 17:10:56 - georgezgeorgez: I'm not familiar with Tempo or Mimir Mon, Aug 26, 2024, 17:11:29 - deeznnutz: cool - I've never seen that before. I can check it out Mon, Aug 26, 2024, 17:12:34 - georgezgeorgez: These days, tools are being developed so fast it seems. I think we just go with something, relatively modern, and then if there's a big reason to change, we change. A few years ago, ELK stack was pretty popular, but I think less now. And I think it's a bit overkill. If there is a criteria, we should consider how lightweight the stack is. Mon, Aug 26, 2024, 17:12:52 - georgezgeorgez: Considering that any resources used for the monitoring stack is taking away from znnd in a single node deploy Mon, Aug 26, 2024, 17:13:41 - deeznnutz: so the next steps are arm support, Infinity data plugin, create znnd dashboard Mon, Aug 26, 2024, 17:13:44 - georgezgeorgez: I'm not 100% sure how useful log aggregation will be for a single node Mon, Aug 26, 2024, 17:14:06 - georgezgeorgez: Considering that all the logs will just be on the box itself Mon, Aug 26, 2024, 17:14:23 - coinselor: Aren't we making the monitoring stack optional when using the script? Mon, Aug 26, 2024, 17:14:24 - georgezgeorgez: But if it helps people isolate the logs around a certain timeframe/metric spike it could still be useful Mon, Aug 26, 2024, 17:14:35 - deeznnutz: we could consider a --send-logs flag Mon, Aug 26, 2024, 17:15:13 - georgezgeorgez: <@coinselor .chat "Aren't we making the monitoring ..."> Yes optional, but hopefully it's useful enough where most node operators want to run it. So lightweight is better imo Mon, Aug 26, 2024, 17:15:24 - deeznnutz: <@coinselor .chat "Aren't we making the monitoring ..."> this was one of my questions. I assumed we would add a flag for --grafana to install it separately Mon, Aug 26, 2024, 17:15:47 - coinselor: I can work on the interactivity of the script. I should be able to look at how the script is installing all the stuff deez is adding and make it interactive so that the user has to choose what to install. Maybe we can make the monitoring stack the (Default) option Mon, Aug 26, 2024, 17:16:19 - georgezgeorgez: deeznnutz: you are the chair. You run a pillar and nodes. What would actually be useful to you? How can we get feedback about what is important for other operators? Mon, Aug 26, 2024, 17:16:57 - georgezgeorgez: As chair, you should try and get feedback from users/stakeholders Mon, Aug 26, 2024, 17:17:13 - georgezgeorgez: Maybe a survey to pillars? Mon, Aug 26, 2024, 17:17:45 - deeznnutz: ya, makes sense. It would be super helpful to me when trouble shooting stuff if I could get logs and settings when helping someone Mon, Aug 26, 2024, 17:17:59 - coinselor: I think the survey might be more useful after we have them use the script for the first time, then get their feedback. Mon, Aug 26, 2024, 17:18:13 - deeznnutz: i always go through a series of questions that are super simple before getting into helping someone. Mon, Aug 26, 2024, 17:18:35 - georgezgeorgez: nice, that is the basis of the "diagnostics" i talked about Mon, Aug 26, 2024, 17:18:43 - deeznnutz: but regarding others, I can ask them what would be useful to them as a pillar/operator Mon, Aug 26, 2024, 17:18:55 - georgezgeorgez: yeah we can do it informally to start Mon, Aug 26, 2024, 17:19:18 - georgezgeorgez: i just want to make sure we're building stuff with guidance from the actual community Mon, Aug 26, 2024, 17:19:37 - georgezgeorgez: i mean we're part of the community, but broader feedback Mon, Aug 26, 2024, 17:19:49 - deeznnutz: what about setting up a producer address like the znn controller does. Mon, Aug 26, 2024, 17:20:02 - deeznnutz: should we have a --producer flag that setups up a producer address? Mon, Aug 26, 2024, 17:20:20 - georgezgeorgez: i think that is only necessary for pillars Mon, Aug 26, 2024, 17:20:34 - georgezgeorgez: so if that is our initial target user then yeah we would need it Mon, Aug 26, 2024, 17:20:42 - georgezgeorgez: but changing the producer also requires changing it on-chain Mon, Aug 26, 2024, 17:20:53 - georgezgeorgez: some people might want to re-use an existing producer Mon, Aug 26, 2024, 17:21:10 - georgezgeorgez: maybe that would be considered a bad practice Mon, Aug 26, 2024, 17:21:29 - deeznnutz: can a producer address be created with the CLI Mon, Aug 26, 2024, 17:21:45 - deeznnutz: I've never created one before without using the znn-controller-software Mon, Aug 26, 2024, 17:22:19 - deeznnutz: <@coinselor .chat "I think the survey might be more..."> maybe we do it before and after Mon, Aug 26, 2024, 17:22:43 - georgezgeorgez: the producer is just a key-pair. The node configuration has to specify the file to use Mon, Aug 26, 2024, 17:22:51 - deeznnutz: for example I know shai wants better monitoring tools. Would be interesting to get his feedback Mon, Aug 26, 2024, 17:23:39 - deeznnutz: right, in the config.json Mon, Aug 26, 2024, 17:24:01 - coinselor: informally asking before sounds good to brainstorm ideas, but I won't be shocked if someone goes 'a tg bot that alerts me about node going down' and similar requests Mon, Aug 26, 2024, 17:25:26 - georgezgeorgez: Sometimes a user doesn't exactly know what they want 😅. It's up to us to translate requests into underlying problems and solve those. The surface level suggestion sometimes will be and sometimes won't be the best path Mon, Aug 26, 2024, 17:25:43 - georgezgeorgez: So another target user could be developers Mon, Aug 26, 2024, 17:25:55 - georgezgeorgez: I created a "devnet' branch of znnd way back Mon, Aug 26, 2024, 17:26:11 - georgezgeorgez: And it sets up the producer and config necessary for a single node testnet Mon, Aug 26, 2024, 17:26:39 - georgezgeorgez: It's baked into znnd. And it means that in order to use it, developers have to rebase their changes on top of the branch Mon, Aug 26, 2024, 17:26:45 - georgezgeorgez: It would be better if creating a devnet was a separate script Mon, Aug 26, 2024, 17:26:56 - georgezgeorgez: Not tied to a specific branch of go-zenon Mon, Aug 26, 2024, 17:27:35 - georgezgeorgez: But I think for Operations, we should focus on node operators first Mon, Aug 26, 2024, 17:27:40 - deeznnutz: So maybe I can start creating issues in GH for this additional functionality. Mon, Aug 26, 2024, 17:28:32 - georgezgeorgez: Yeah it's no problem to define more work Mon, Aug 26, 2024, 17:28:56 - georgezgeorgez: We should have a selection of possible things to do and then work with the users/stakeholders to pick what to do next Mon, Aug 26, 2024, 17:29:02 - deeznnutz: We are talking about
  • interactive installation menu
  • producer flag
  • testnet flag Mon, Aug 26, 2024, 17:29:26 - georgezgeorgez: Do you have an idea of how a menu would work? Mon, Aug 26, 2024, 17:29:33 - deeznnutz: in addition to the things mentioned above to integrate znnd monitoring Mon, Aug 26, 2024, 17:29:42 - georgezgeorgez: I think doing it in bash wouldn't be so pretty Mon, Aug 26, 2024, 17:29:49 - deeznnutz: <@georgezgeorgez .chat "Do you have an idea of how a men..."> I know how it won't work... lol Mon, Aug 26, 2024, 17:30:08 - deeznnutz: I tried it and could not get one working with an install command with curl. Mon, Aug 26, 2024, 17:30:28 - deeznnutz: maybe I just gave up too early Mon, Aug 26, 2024, 17:30:34 - georgezgeorgez: https://github.com/charmbracelet/bubbletea If we do go in the direction of TUIs Mon, Aug 26, 2024, 17:31:11 - deeznnutz: that would be awesome Mon, Aug 26, 2024, 17:31:19 - georgezgeorgez: deeznnutz: but again, probably not the near term focus. What do you think we should try and have done before next meeting? Mon, Aug 26, 2024, 17:31:59 - deeznnutz: my goal is to get the new datasource integrated and a custom znnd dashboard working Mon, Aug 26, 2024, 17:32:06 - deeznnutz: that what I can work on. Mon, Aug 26, 2024, 17:32:21 - georgezgeorgez: Cool that's what I was thinking too. Maybe at the very least, wireframe JSON for the dashboard? And some poc of the infinity plugin for at least one of the graphs Mon, Aug 26, 2024, 17:32:25 - deeznnutz: and pull in your changes Mon, Aug 26, 2024, 17:32:55 - deeznnutz: <@georgezgeorgez .chat "Cool that's what I was thinking ..."> yes that is doable Mon, Aug 26, 2024, 17:33:52 - deeznnutz: The TUI framework would be super cool. Mon, Aug 26, 2024, 17:33:53 - georgezgeorgez: <@deeznnutz .chat "and pull in your changes"> If we want to actually test it live, we would need to run an arm64 server Mon, Aug 26, 2024, 17:34:12 - deeznnutz: <@georgezgeorgez .chat "If we want to actually test it l..."> DO does not have them. So I can test on another platform Mon, Aug 26, 2024, 17:34:14 - georgezgeorgez: Long term, I think it would make sense for us to have some test scripts that interact with cloud provider APIs to spin up nodes etc Mon, Aug 26, 2024, 17:34:35 - georgezgeorgez: Run some tests, spit out data, and then tear it down Mon, Aug 26, 2024, 17:34:52 - georgezgeorgez: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html Mon, Aug 26, 2024, 17:35:04 - coinselor: I can test the arm64 changes Mon, Aug 26, 2024, 17:35:39 - deeznnutz: I need to add arm support for the grafana install too Mon, Aug 26, 2024, 17:36:00 - deeznnutz: maybe we can all take a look at the TUI framework. Mon, Aug 26, 2024, 17:36:22 - deeznnutz: between that and everything else going on I think this is doable in the next 2 weeks Mon, Aug 26, 2024, 17:36:40 - georgezgeorgez: A lot of my devnet branch could be carved out. But let's get a dashboard out first, before we make things pretty Mon, Aug 26, 2024, 17:36:58 - deeznnutz: cool. sounds like a plan Mon, Aug 26, 2024, 17:37:04 - georgezgeorgez: I can help with the Infinity plugin and grafana dashboard JSON and whatever else really Mon, Aug 26, 2024, 17:37:35 - deeznnutz: maybe you can take on the dashboard after I get the plugin installed and setup Mon, Aug 26, 2024, 17:37:36 - georgezgeorgez: And yeah maybe before the next meeting, we can talk with other pillar operators Mon, Aug 26, 2024, 17:38:02 - deeznnutz: I have time this week. I'm traveling T-TH next week. Mon, Aug 26, 2024, 17:38:31 - georgezgeorgez: When do you think we should meet next? Mon, Aug 26, 2024, 17:38:59 - deeznnutz: Sept 9 @ 6PM EST? does that work? Mon, Aug 26, 2024, 17:39:32 - georgezgeorgez: Should be good Mon, Aug 26, 2024, 17:39:41 - georgezgeorgez: coinselor hbu? Mon, Aug 26, 2024, 17:39:51 - coinselor: Ye that works Mon, Aug 26, 2024, 17:40:08 - deeznnutz: cool. sounds like a plan. thx everyone!! Mon, Aug 26, 2024, 17:40:21 - georgezgeorgez: Anything else you want to go over? Or call it for today? Mon, Aug 26, 2024, 17:40:47 - deeznnutz: I'm good. did you see my post on dynamic fusing? Mon, Aug 26, 2024, 17:40:54 - deeznnutz: am I retarded? Mon, Aug 26, 2024, 17:41:11 - georgezgeorgez: Not sure if either question is within scope of the SIG Mon, Aug 26, 2024, 17:41:16 - georgezgeorgez: haha Mon, Aug 26, 2024, 17:41:19 - deeznnutz: lol Mon, Aug 26, 2024, 17:41:26 - deeznnutz: ya, we can chat about that elsewhere Mon, Aug 26, 2024, 17:41:44 - georgezgeorgez: Thank you everyone. Thank you deeznnutz for facilitating as chair