Threaded Mode | Linear Mode

**ABearden** · (This post was last modified: 03-12-2022, 12:10 PM by ABearden.)

(03-10-2022, 05:51 PM)Claud Wrote: I'm working on a project that would download mp3's from an RSS feed and transcribe the audio from those mp3's into a large JSON file. Or you can take a YouTube channel url and download all the vtt files that YouTube already has and convert them to text files.

I still need to figure out how to convert those text files into JSON as well.

Converting them to JSON is important for speedy indexing when it gets uploaded to a postgresql database.

I think I'm right with the steps involved but id love someone with more experience coding to give me some pointers.

The GitHub for the project is here, https://github.com/claudchereji/PodSearch

It's still very much a work in progress.

I'm curious how the JSON is connected to database indexing. I would think raw data would be more efficient, but I'm not heavily involved in Postgre. In either case, VTT files are already text files.

Also, the conversion to JSON entirely depends on how you will be using it. For example, if you have a VTT file that could convert to something like this with extra metadata:

Code:
{

    "title": "Introduction",

    "url": "http://youtu.be/gwrg832j2",

    "captions":

        [

            {

                "timestamp": "00:00:15.000 --> 00:00:18.200",

                "align": "middle",

                "text": "Hi Makerspace, I'm Claud."

            },

            {

                "timestamp": "00:00:19.300 --> 00:00:21.600",

                "align": "middle",

                "text": "Today we're building a VTT to JSON converter."

            }

        ]

}

For a search engine, how you store the data will be heavily influenced by the algorithm you're using. My experience with writing search engines is a bit old, but my first instinct would be to separate the individual captions into their own table and precompile individual word scores that are well indexed.