cross-posted from: https://lemmy.world/post/28808772

Finally released an alpha build for the PeerTube recommendation algorithm!
Basic UI is complete. If you want to try it out, the link is here:
👉 https://github.com/solidheron/peertube_recomendation_algorythm

New features since the last build:

  • Sort by videos that share your time engagement similarity.
  • Sort by videos that share your like similarity.
  • Display of like similarity cosine values.
  • Basic information shown for recommended videos (title, account, and channel names).
  • 404 check for generated instance links (so you don’t get stuck clicking into dead videos—you’ll know which instance hosts the video).
  • De-ranking for previously seen videos (simply a 0.5x multiplier on time and like similarity).

Features from previous builds:

  • Ability to input multiple instance domain names (DNs) and generate playable video links.
  • Limit of 5 recommendations per channel to avoid floods (e.g., during testing, The Linux Experiment would dominate otherwise—this limit is more of a failsafe than a feature).

Personal thoughts:
I still think cosine similarity beats chronological algorithms.
This algorithm also synergizes with other algorithms—it’s great for finding videos that appear next to or below what you’re currently watching.

You can also revisit videos you previously liked to help strengthen your like similarity vectors.


Moving forward: basic design philosophies and current issues

There’s an issue I’m calling the “Linux pipeline.”
Basically, Linux-related videos tend to dominate PeerTube’s well-produced content.
Since the algorithm relies on English words in descriptions, titles, and tags, Linux videos—which sometimes have fewer general keywords—end up being more “orthogonal” to typical user vectors, causing lower ranking.

Another challenge:
It’s really hard to properly combine like cosine similarity and time engagement cosine similarity.
You could add them, but it doesn’t fully make sense:

  • High like similarity + high time engagement similarity = you probably like and will watch the video longer.
  • But short videos can be liked even if they contribute almost nothing to time engagement (because time engagement is based on percentage watched × video length).

If I combined them, it would basically enter machine learning territory:
You’d have to adjust proportions dynamically based on user behavior.
Since I want this algorithm scoped to one person only (no data sharing yet), that level of ML is out of scope for now.

(Sharing data across devices could come later—Brave browser has sync features, and PeerTube watch history syncing could be possible.)


Summary:
Most of the data structure is settling into place.
Future updates will probably focus on expanding the data structure and making small improvements.

  • iso@lemy.lol
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    6 months ago

    I think open discovery algorithms are the way. We are against algos but sorting by like similarity would be beneficial.

    What are you guys thinking? @dessalines@lemmy.ml @nutomic@lemmy.ml Are you optimistic about this or fuck any algorithms?

    • Nutomic@lemmy.ml
      link
      fedilink
      English
      arrow-up
      11
      ·
      6 months ago

      Algorithms are definitely needed to discover good content. There are some good videos on Peertube, but its very difficult to find them due to all the low effort spam. Lemmy also had different algorithms from the beginning and no one ever complained about them.

      The problem with algorithms used by Reddit, Facebook etc is that they are completely intransparent, and include factors which dont benefit the user, such as “engagement” or advertising. As long as Fediverse algorithms are focused on benefitting the user and are transparent there is nothing wrong with them.

      • Cattail@lemmy.worldOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        6 months ago

        I want to encourage fedizens and Internet user to collect their own data and run their own algorithms since it’s worked well for corporations in general but to keep using their services. Seems as though people enjoy algo based on their data.

        It’s neat that fediverse has a general principle of not collecting user data, so if more people used fediverse instances more often then less Data going to corporations. This browser extension is outline of how collecting your own data can affect your experience with fediverse. There’s so much you can do with your data and data from api