In this three part series, I explore how to extract information from unstructured data, in particular 10Ks. In the first part, I explain how to pre-process documents and create an index of embeddings. This sets up the ability to retrieve the subdocuments that is the most relevant to the target extraction and is explained in Part 2. The steps and outline for this post are:
As shown in a previous post, extraction can be done quite simply using foundational models. However, this can be expensive and thus requires the introduction of complexity. Even when using an open source model, loading and running a foundational model can have its own challenges. We will revisit those challenges in a future post.
The code is here. In particular, you can follow along using the notebook preprocessing.ipynb.
Chunking the Document
For chunking, I take advantage of the langchain libraries:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
The parameters to consider are chunk size and chunk overlap. I chose 1250 and 100, respectively, to start. This can and should be calibrated to final performance metrics. Offline, I saw that chunk size 1250 is approximately 100 tokens. An important design choice is the splitting strategy. From the documentation:
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.
Finally, the choice of document and handling of tables was ignored, but clearly a fundamental factor in performance.
Embedding the Subdocuments
I used the HuggingFace library to run two small, but effective models: 'intfloat/e5-small-v2' and 'thenlper/gte-small'. For the former, the text needs to be prepended with "passage" and "query" before applying the model. Both of these create embedding vectors of dimension 384 per token.
A design choice is how to collapse the vector per token into a single representative vector. The conventional approach is to do an average over all the words. Note this is tied to the chunking part. In practice, this averaging should be done only over the non-zero vectors, for which you can use the attention mask arguments.
Indexing the Embeddings
In a previous post, I used Pinecone and found it very simple to use. For this project, I wanted to try something on my local machine so I learned about OpenSearch. To use this, you also need to be familiar with Docker. I spun up a container using an opensearch image and the command in the docker.sh file in my repo. I then used the python client to create an index and add the documents plus the embeddings. Not something I knew out of the box, so it took some time to understand the nuances. In particular, formatting of the search query and the scoring for retrieval, discussed in the next post.
Finally, I scaled this to 5 pdfs and presumably, this can be scaled to more. To go from 1 to 5 took some code refactoring and managing memory. Garbage collection on the transformer model doesn't work as I expected (why?) when functions go out of scope. Thus, running a function 5 times results in memory capacity limitations. Instead, the code was refactored to load the model into memory once and re-use.