• Barbarian@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    8
    ·
    edit-2
    4 months ago

    I get the sentiment, but it’s a bad example. Transformer models don’t recognize images in any useful way that could be fed to other systems. They also don’t have any capability of actual understanding or context. Heavily simplifying here, tokenisation of inputs allows them to group clusters of letters together into tokens, so when it receives tokens it can spit out whatever the training data says it should.

    The only actual things that are improving greatly here which could be used in different systems are natural language processing, natural language output and visual output.

    EDIT: Crossed out stuff that is wrong.

    • MrConfusion@lemmy.world
      link
      fedilink
      English
      arrow-up
      12
      arrow-down
      1
      ·
      4 months ago

      Well, this is simply incorrect. And confidently incorrect at that.

      Vision transformers (ViT) is an important branch of computer vision models that apply transformers to image analysis and detection tasks. They perform very well. The main idea is the same, by tokenizing the input image into smaller chunks you can apply the same attention mechanism as in NLP transformer models.

      ViT models were introduced in 2020 by Dosovitsky et. al, in the hallmark paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929). A work that has received almost 30000 academic citations since its publication.

      So claiming transformers only improve natural language and vision output is straight up wrong. It is also widely used in visual analysis including classification and detection.

      • Barbarian@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        1
        ·
        4 months ago

        Thank you for the correction. So hypothetically, with millions of hours of GoPro footage from the scuttle crew, and if we had some futuristic supercomputer that could crunch live data from a standard definition camera and output decisions, we could hook that up to a Boston dynamics style robot and run one replaced member of the crew?

    • GBU_28@lemm.ee
      cake
      link
      fedilink
      English
      arrow-up
      9
      arrow-down
      1
      ·
      edit-2
      4 months ago

      Huh? Image ai to semantic formating, then consumption is trivial now

      • Barbarian@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        5
        ·
        edit-2
        4 months ago

        Could you give me an example that uses live feeds of video data, or feeds the output to another system? As far as I’m aware (I could be very wrong! Not an expert), the only things that come close to that are things like OCR systems and character recognition. Describing in machine-readable actionable terms what’s happening in an image isn’t a thing, as far as I know.

        • GBU_28@lemm.ee
          cake
          link
          fedilink
          English
          arrow-up
          8
          ·
          edit-2
          4 months ago

          No live video no, that didn’t seem the topic

          But if you had the horsepower, I don’t think it’s impossible based on what I’ve worked with. It’s just about snipping and distributing the images, from a bottleneck standpoint

          • Barbarian@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            3
            ·
            edit-2
            4 months ago

            No live videos

            Well, that’d be a prerequisite to a transformer model making decisions for a ship scuttling robot, hence why I brought it up.

        • FooBarrington@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          4 months ago

          Describing in machine-readable actionable terms what’s happening in an image isn’t a thing, as far as I know.

          It is. That’s actually the basis of multimodal transformers - they have a shared embedding space for multiple modes of data (e.g. text and images). If you encode data and take those embeddings, you suddenly have a vector describing the contents of your input.