You Need to Build Your own Personal Corpus before the Singularity

Don’t let big companies own your metadata and data, take control and create your own personalized LLM with your own corpus of information.

Andrew Crider
4 min readFeb 17, 2023
Image available to author via unique access to Midjourney and DALL-E; the author assumes responsibility for the authenticity.

The advancements in AI have been dramatic over the past few months. Midjourney and Stable Diffusion have increased in complexity, NERF can create three-dimensional experiences with just a series of pictures, and ChatGPT has transformed from a cool internet thing to a business that Microsoft is doubling down on.

I’ve been ramping up myself over the past few months, making everything from pictures to augment writing prompts, aprons for your summer cookout, and ambient music videos on YouTube. However, with the advancement of Large Language Models, and the algorithm behind ChatGPT, I think it’s time to incorporate more AI into my life.

Embeddings … it’s about the data

Large Language Models (LLMs) are trained on data. Massive voluminous amounts of data. ChatGPT is trained on 45 terabytes of data. Here’s what it has to say about itself: