Operations SIG 26 Aug 2024

From Zenon Wiki
Jump to navigation Jump to search

Agenda

What: Meeting to Discuss Improving Node Operations as part of the HC1: Operations SIG

When: 26 Aug 2024 @ 6PM EST

Where: https://matrix.to/#/#sig-operations:hc1.chat

Chair: 0x3639

Agenda:

  1. Discuss follow Up items from previous meeting
  2. Document action items
  3. Establish next meeting

If you want to attend please respond (or DM) with your full matrix username and I will invite you to the group. No FUD, anger or BS allowed.

Pre-meeting Notes

0x3639

  • Added `grafana.sh` https://github.com/go-zenon/go/blob/main/grafana.sh. This automates the installation of grafana, node_exporter & promethesus. It creates a default promethesus datasource, scrapes the node_exporter endpoint, and installs a node_exporter dashboard. Tested on amd64. Need to add arm64 support.
  • Started to investigate a custom dashboard for znnd. I created this for docker previous. It leveraged the JSON API data source for Grafana. However, this plugin is now in maintenance mode, no new features will be added. Grafana recommends using the Infinity data source plugin instead.
  • I started to investigate the Infinity data source plugin. It will be used to scrape the api endpoints to report `syncStatus` and other important metrics.
  • Next we can consider installing Loki to manage log files. We can discuss at the meeting.

George

Coinselor

  • Made ASCII Art more readable at lower resolutions.
  • Added --help flag https://github.com/go-zenon/go/pull/8
  • I can test arm64 support, will be spawning an arm VPS for the Supernova testnet.

Minutes

Mon, Aug 26, 2024, 17:03:11 - deeznnutz: Thx everyone for contributing to the go-zenon bash script. We are making good progress.

Mon, Aug 26, 2024, 17:03:21 - deeznnutz: I merged in @coinselor's PR #8 to improve the ASCII art and add a --help flag.

Mon, Aug 26, 2024, 17:03:34 - deeznnutz: those changes were pretty straight forward

Mon, Aug 26, 2024, 17:03:42 - deeznnutz: George submitted the PR for arm64 support. I have not tested it yet. Once we test it we can pull in that change. It's pretty simple. He submitted an issue to make sure the script checks for apt and systemd. Should we clarify that as a requirement or have the script check for the proper operating system and systemd?

Mon, Aug 26, 2024, 17:04:05 - georgezgeorgez: I think it's fine just to document it for now.

Mon, Aug 26, 2024, 17:04:15 - georgezgeorgez: I think it's okay for us to do 1 deployment target really well first.

Mon, Aug 26, 2024, 17:04:45 - georgezgeorgez: The people who need the most support will probably be choosing ubuntu/deb as their recommended OS.

Mon, Aug 26, 2024, 17:04:51 - deeznnutz: ya, makes sense. Should we check in the script and halt it if apt and systemd are not present?

Mon, Aug 26, 2024, 17:05:25 - georgezgeorgez: We could do that, but not a priority.

Mon, Aug 26, 2024, 17:05:46 - deeznnutz: OK - I can add that as a todo and we can deal with it later.

Mon, Aug 26, 2024, 17:05:47 - georgezgeorgez: We should try and get someone to use this script in the wild asap and get information about their nodes via the monitoring

Mon, Aug 26, 2024, 17:06:03 - deeznnutz: I setup a stand alone script to automates the installation of grafana, node_exporter & promethesus. It creates a default promethesus datasource, scrapes the node_exporter endpoint, and installs a default node_exporter dashboard. It currently only works on amd64.

Mon, Aug 26, 2024, 17:06:21 - deeznnutz: TODO

  • We need to expand functionality to arm64
  • Add a custom dashboard for znnd. This will require installing the Infinity data plugin and adding a new datasource (you can add an API endpoint as a datasource and it scrapes the API at x interval).
  • Potential znnd metrics to show:
    • Sync status
    • currentHeight
    • targetHeight
    • version
    • commit
    • numPeers
    • stats.osInfo
  • What else should we include?

Mon, Aug 26, 2024, 17:06:35 - georgezgeorgez: What is the infinity data plugin?

Mon, Aug 26, 2024, 17:06:45 - deeznnutz: it's a plugin that allows curl calls

Mon, Aug 26, 2024, 17:07:03 - deeznnutz: it basically runs them on a schedule and then you can display the data in a dashboard

Mon, Aug 26, 2024, 17:07:22 - georgezgeorgez: gotcha. That might be the fastest way

Mon, Aug 26, 2024, 17:07:34 - georgezgeorgez: There could be other relatively quick methods like parsing logs

Mon, Aug 26, 2024, 17:07:54 - deeznnutz: previously I used JSON API and it worked great. but that plugin is no longer under development

Mon, Aug 26, 2024, 17:08:04 - coinselor: I think syrius shows quite a few znnd metrics, we could use that as reference

Mon, Aug 26, 2024, 17:08:49 - georgezgeorgez: Long term, I think we should consider building metrics into the node I think https://opentelemetry.io/ is worth considering, but not really the next step for us

Mon, Aug 26, 2024, 17:09:15 - deeznnutz: that would be awesome.

Mon, Aug 26, 2024, 17:09:34 - georgezgeorgez: In terms of other metrics, what would help us debug a production issue or a testnet failure?

Mon, Aug 26, 2024, 17:09:45 - georgezgeorgez: We might need different dashboards for prod and dev envs

Mon, Aug 26, 2024, 17:10:00 - deeznnutz: We can add Loki the log processor

Mon, Aug 26, 2024, 17:10:15 - deeznnutz: I've tested that before. it can parse all the logs and you can display them any way you want

Mon, Aug 26, 2024, 17:10:48 - georgezgeorgez: Grafana has something called the LGTM stack https://grafana.com/go/webinar/getting-started-with-grafana-lgtm-stack/

Mon, Aug 26, 2024, 17:10:56 - georgezgeorgez: I'm not familiar with Tempo or Mimir

Mon, Aug 26, 2024, 17:11:29 - deeznnutz: cool - I've never seen that before. I can check it out

Mon, Aug 26, 2024, 17:12:34 - georgezgeorgez: These days, tools are being developed so fast it seems. I think we just go with something, relatively modern, and then if there's a big reason to change, we change. A few years ago, ELK stack was pretty popular, but I think less now. And I think it's a bit overkill. If there is a criteria, we should consider how lightweight the stack is.

Mon, Aug 26, 2024, 17:12:52 - georgezgeorgez: Considering that any resources used for the monitoring stack is taking away from znnd in a single node deploy

Mon, Aug 26, 2024, 17:13:41 - deeznnutz: so the next steps are arm support, Infinity data plugin, create znnd dashboard

Mon, Aug 26, 2024, 17:13:44 - georgezgeorgez: I'm not 100% sure how useful log aggregation will be for a single node

Mon, Aug 26, 2024, 17:14:06 - georgezgeorgez: Considering that all the logs will just be on the box itself

Mon, Aug 26, 2024, 17:14:23 - coinselor: Aren't we making the monitoring stack optional when using the script?

Mon, Aug 26, 2024, 17:14:24 - georgezgeorgez: But if it helps people isolate the logs around a certain timeframe/metric spike it could still be useful

Mon, Aug 26, 2024, 17:14:35 - deeznnutz: we could consider a --send-logs flag

Mon, Aug 26, 2024, 17:15:13 - georgezgeorgez: <@coinselor

.chat "Aren't we making the monitoring ..."> Yes optional, but hopefully it's useful enough where most node operators want to run it. So lightweight is better imo

Mon, Aug 26, 2024, 17:15:24 - deeznnutz: <@coinselor

.chat "Aren't we making the monitoring ..."> this was one of my questions. I assumed we would add a flag for --grafana to install it separately

Mon, Aug 26, 2024, 17:15:47 - coinselor: I can work on the interactivity of the script. I should be able to look at how the script is installing all the stuff deez is adding and make it interactive so that the user has to choose what to install. Maybe we can make the monitoring stack the (Default) option

Mon, Aug 26, 2024, 17:16:19 - georgezgeorgez: deeznnutz: you are the chair. You run a pillar and nodes. What would actually be useful to you? How can we get feedback about what is important for other operators?

Mon, Aug 26, 2024, 17:16:57 - georgezgeorgez: As chair, you should try and get feedback from users/stakeholders

Mon, Aug 26, 2024, 17:17:13 - georgezgeorgez: Maybe a survey to pillars?

Mon, Aug 26, 2024, 17:17:45 - deeznnutz: ya, makes sense. It would be super helpful to me when trouble shooting stuff if I could get logs and settings when helping someone

Mon, Aug 26, 2024, 17:17:59 - coinselor: I think the survey might be more useful after we have them use the script for the first time, then get their feedback.

Mon, Aug 26, 2024, 17:18:13 - deeznnutz: i always go through a series of questions that are super simple before getting into helping someone.

Mon, Aug 26, 2024, 17:18:35 - georgezgeorgez: nice, that is the basis of the "diagnostics" i talked about

Mon, Aug 26, 2024, 17:18:43 - deeznnutz: but regarding others, I can ask them what would be useful to them as a pillar/operator

Mon, Aug 26, 2024, 17:18:55 - georgezgeorgez: yeah we can do it informally to start

Mon, Aug 26, 2024, 17:19:18 - georgezgeorgez: i just want to make sure we're building stuff with guidance from the actual community

Mon, Aug 26, 2024, 17:19:37 - georgezgeorgez: i mean we're part of the community, but broader feedback

Mon, Aug 26, 2024, 17:19:49 - deeznnutz: what about setting up a producer address like the znn controller does.

Mon, Aug 26, 2024, 17:20:02 - deeznnutz: should we have a --producer flag that setups up a producer address?

Mon, Aug 26, 2024, 17:20:20 - georgezgeorgez: i think that is only necessary for pillars

Mon, Aug 26, 2024, 17:20:34 - georgezgeorgez: so if that is our initial target user then yeah we would need it

Mon, Aug 26, 2024, 17:20:42 - georgezgeorgez: but changing the producer also requires changing it on-chain

Mon, Aug 26, 2024, 17:20:53 - georgezgeorgez: some people might want to re-use an existing producer

Mon, Aug 26, 2024, 17:21:10 - georgezgeorgez: maybe that would be considered a bad practice

Mon, Aug 26, 2024, 17:21:29 - deeznnutz: can a producer address be created with the CLI

Mon, Aug 26, 2024, 17:21:45 - deeznnutz: I've never created one before without using the znn-controller-software

Mon, Aug 26, 2024, 17:22:19 - deeznnutz: <@coinselor

.chat "I think the survey might be more..."> maybe we do it before and after

Mon, Aug 26, 2024, 17:22:43 - georgezgeorgez: the producer is just a key-pair. The node configuration has to specify the file to use

Mon, Aug 26, 2024, 17:22:51 - deeznnutz: for example I know shai wants better monitoring tools. Would be interesting to get his feedback

Mon, Aug 26, 2024, 17:23:39 - deeznnutz: right, in the config.json

Mon, Aug 26, 2024, 17:24:01 - coinselor: informally asking before sounds good to brainstorm ideas, but I won't be shocked if someone goes 'a tg bot that alerts me about node going down' and similar requests

Mon, Aug 26, 2024, 17:25:26 - georgezgeorgez: Sometimes a user doesn't exactly know what they want 😅. It's up to us to translate requests into underlying problems and solve those. The surface level suggestion sometimes will be and sometimes won't be the best path

Mon, Aug 26, 2024, 17:25:43 - georgezgeorgez: So another target user could be developers

Mon, Aug 26, 2024, 17:25:55 - georgezgeorgez: I created a "devnet' branch of znnd way back

Mon, Aug 26, 2024, 17:26:11 - georgezgeorgez: And it sets up the producer and config necessary for a single node testnet

Mon, Aug 26, 2024, 17:26:39 - georgezgeorgez: It's baked into znnd. And it means that in order to use it, developers have to rebase their changes on top of the branch

Mon, Aug 26, 2024, 17:26:45 - georgezgeorgez: It would be better if creating a devnet was a separate script

Mon, Aug 26, 2024, 17:26:56 - georgezgeorgez: Not tied to a specific branch of go-zenon

Mon, Aug 26, 2024, 17:27:35 - georgezgeorgez: But I think for Operations, we should focus on node operators first

Mon, Aug 26, 2024, 17:27:40 - deeznnutz: So maybe I can start creating issues in GH for this additional functionality.

Mon, Aug 26, 2024, 17:28:32 - georgezgeorgez: Yeah it's no problem to define more work

Mon, Aug 26, 2024, 17:28:56 - georgezgeorgez: We should have a selection of possible things to do and then work with the users/stakeholders to pick what to do next

Mon, Aug 26, 2024, 17:29:02 - deeznnutz: We are talking about

  • interactive installation menu
  • producer flag
  • testnet flag

Mon, Aug 26, 2024, 17:29:26 - georgezgeorgez: Do you have an idea of how a menu would work?

Mon, Aug 26, 2024, 17:29:33 - deeznnutz: in addition to the things mentioned above to integrate znnd monitoring

Mon, Aug 26, 2024, 17:29:42 - georgezgeorgez: I think doing it in bash wouldn't be so pretty

Mon, Aug 26, 2024, 17:29:49 - deeznnutz: <@georgezgeorgez

.chat "Do you have an idea of how a men..."> I know how it won't work... lol

Mon, Aug 26, 2024, 17:30:08 - deeznnutz: I tried it and could not get one working with an install command with curl.

Mon, Aug 26, 2024, 17:30:28 - deeznnutz: maybe I just gave up too early

Mon, Aug 26, 2024, 17:30:34 - georgezgeorgez: https://github.com/charmbracelet/bubbletea If we do go in the direction of TUIs

Mon, Aug 26, 2024, 17:31:11 - deeznnutz: that would be awesome

Mon, Aug 26, 2024, 17:31:19 - georgezgeorgez: deeznnutz: but again, probably not the near term focus. What do you think we should try and have done before next meeting?

Mon, Aug 26, 2024, 17:31:59 - deeznnutz: my goal is to get the new datasource integrated and a custom znnd dashboard working

Mon, Aug 26, 2024, 17:32:06 - deeznnutz: that what I can work on.

Mon, Aug 26, 2024, 17:32:21 - georgezgeorgez: Cool that's what I was thinking too. Maybe at the very least, wireframe JSON for the dashboard? And some poc of the infinity plugin for at least one of the graphs

Mon, Aug 26, 2024, 17:32:25 - deeznnutz: and pull in your changes

Mon, Aug 26, 2024, 17:32:55 - deeznnutz: <@georgezgeorgez

.chat "Cool that's what I was thinking ..."> yes that is doable

Mon, Aug 26, 2024, 17:33:52 - deeznnutz: The TUI framework would be super cool.

Mon, Aug 26, 2024, 17:33:53 - georgezgeorgez: <@deeznnutz

.chat "and pull in your changes"> If we want to actually test it live, we would need to run an arm64 server

Mon, Aug 26, 2024, 17:34:12 - deeznnutz: <@georgezgeorgez

.chat "If we want to actually test it l..."> DO does not have them. So I can test on another platform

Mon, Aug 26, 2024, 17:34:14 - georgezgeorgez: Long term, I think it would make sense for us to have some test scripts that interact with cloud provider APIs to spin up nodes etc

Mon, Aug 26, 2024, 17:34:35 - georgezgeorgez: Run some tests, spit out data, and then tear it down

Mon, Aug 26, 2024, 17:34:52 - georgezgeorgez: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html

Mon, Aug 26, 2024, 17:35:04 - coinselor: I can test the arm64 changes

Mon, Aug 26, 2024, 17:35:39 - deeznnutz: I need to add arm support for the grafana install too

Mon, Aug 26, 2024, 17:36:00 - deeznnutz: maybe we can all take a look at the TUI framework.

Mon, Aug 26, 2024, 17:36:22 - deeznnutz: between that and everything else going on I think this is doable in the next 2 weeks

Mon, Aug 26, 2024, 17:36:40 - georgezgeorgez: A lot of my devnet branch could be carved out. But let's get a dashboard out first, before we make things pretty

Mon, Aug 26, 2024, 17:36:58 - deeznnutz: cool. sounds like a plan

Mon, Aug 26, 2024, 17:37:04 - georgezgeorgez: I can help with the Infinity plugin and grafana dashboard JSON and whatever else really

Mon, Aug 26, 2024, 17:37:35 - deeznnutz: maybe you can take on the dashboard after I get the plugin installed and setup

Mon, Aug 26, 2024, 17:37:36 - georgezgeorgez: And yeah maybe before the next meeting, we can talk with other pillar operators

Mon, Aug 26, 2024, 17:38:02 - deeznnutz: I have time this week. I'm traveling T-TH next week.

Mon, Aug 26, 2024, 17:38:31 - georgezgeorgez: When do you think we should meet next?

Mon, Aug 26, 2024, 17:38:59 - deeznnutz: Sept 9 @ 6PM EST? does that work?

Mon, Aug 26, 2024, 17:39:32 - georgezgeorgez: Should be good

Mon, Aug 26, 2024, 17:39:41 - georgezgeorgez: coinselor hbu?

Mon, Aug 26, 2024, 17:39:51 - coinselor: Ye that works

Mon, Aug 26, 2024, 17:40:08 - deeznnutz: cool. sounds like a plan. thx everyone!!

Mon, Aug 26, 2024, 17:40:21 - georgezgeorgez: Anything else you want to go over? Or call it for today?

Mon, Aug 26, 2024, 17:40:47 - deeznnutz: I'm good. did you see my post on dynamic fusing?

Mon, Aug 26, 2024, 17:40:54 - deeznnutz: am I retarded?

Mon, Aug 26, 2024, 17:41:11 - georgezgeorgez: Not sure if either question is within scope of the SIG

Mon, Aug 26, 2024, 17:41:16 - georgezgeorgez: haha

Mon, Aug 26, 2024, 17:41:19 - deeznnutz: lol

Mon, Aug 26, 2024, 17:41:26 - deeznnutz: ya, we can chat about that elsewhere

Mon, Aug 26, 2024, 17:41:44 - georgezgeorgez: Thank you everyone. Thank you deeznnutz for facilitating as chair