You Need to Build Your own Personal Corpus before the Singularity

Don’t let big companies own your metadata and data, take control and create your own personalized LLM with your own corpus of information.

Andrew Crider


Image available to author via unique access to Midjourney and DALL-E; the author assumes responsibility for the authenticity.

The advancements in AI have been dramatic over the past few months. Midjourney and Stable Diffusion have increased in complexity, NERF can create three-dimensional experiences with just a series of pictures, and ChatGPT has transformed from a cool internet thing to a business that Microsoft is doubling down on.

I’ve been ramping up myself over the past few months, making everything from pictures to augment writing prompts, aprons for your summer cookout, and ambient music videos on YouTube. However, with the advancement of Large Language Models, and the algorithm behind ChatGPT, I think it’s time to incorporate more AI into my life.

Embeddings … it’s about the data

Large Language Models (LLMs) are trained on data. Massive voluminous amounts of data. ChatGPT is trained on 45 terabytes of data. Here’s what it has to say about itself:

Generated through ChatGPT

LLMs only know what has been fed as training data, so if you ask who is the current Representative of your district, it won’t know that, but it will be confident in the response, even if the answer is from 2021. When the model is trained, it uses embeddings to establish relationships between words and, in theory, concepts. I could explain it to you, but:

ChatGPT explaining embeddings

Now a couple of things jump out here. First off, LLMs are not always accurate. The computer is guessing … it doesn’t know what the next word or sentence means; it just knows that the vectors of the embedded tokens are close. So cars and Ford are close to each other, which is why they are related in the model’s response. The second thing is that the training data is the most critical part of the model. You could create a model…