LLM Use Case: Mining 10Ks

In this post, I look at trying to build a structured dataset from 10Ks using LLMs.

Ideally, the LLM consumes the entire 10K and then answers any question. However, there is a limitation in the context length that an LLM has. For example, GPT3's text-davinci-003 has a 4,000 token limit. A 10K filing from a publicly traded company has about 0.5 million tokens. Thus, a simple prompt with the filing as the preceding text followed by the query does not work.

There are a few ways to make the task feasible that involve chunking or splitting a text into sub-document. This limitation can be a strength because it can take advantage of ensembling, which will make the results more robust, but requires additional operational effort.

The query can be applied over multiple sub-documents, yielding multiple answers per single query. This is post-precessed (using ensembling and human rules) to arrive at a singular, confident answer.

In the example below, I will operate on the Google 10K filing from 2022. I will attempt to retrieve the company name, the ticker, and the annual revenue. For the company name, the query looks as follows, with similar queries for the other two:

query = ''' What is the name of the company being discussed? if you don't know, say "I dont know" Otherwise, state the full company name '''

Approach 1: Brute Force + $! = Unscalable

After loading the pdf using langchain.document_loaders.UnstructuredPDFLoader, the document is split into 122 sub-documents with approximately 3000 tokens each. Then each sub-document is prompted with a query, yielding 122 answers. The results are below and impressive. However, these 3 queries x 122 sub-documents of 3000 tokens each cost about $5 to process, which is prohibitively expensive. It also took a long time, so to achieve operational scale, it would require parallelization.

Results: Company Name: Alphabet Inc. 77 I don't know. 35 Google 4 Alphabet 4 Fitbit 1 Oracle America, Inc. 1 Ticker: I don't know. 117 GOOGL 2 GOOGL and GOOG 1 GOOG, GOOGL 1 Alphabet Inc. (GOOGL) 1 Annual Revenue: I don't know. 116 $257,637 million 2 257,637 2 The latest annual revenue of the company is $257.6 billion. 1 257,637 million 1

Approach 2: Embedding

In this approach, each of the 122 sub-documents are embedded, which is a one time cost. Then, the query is embedded and retrieves the top N documents that are relevant to the query. This makes the overall approach cheaper and more scalable, but not necessarily robust because we lose the ensembling components. In the sample below, it worked well! And the cost was effectively nil.

Results:

  • Company Name: Alphabet Inc.
  • Ticker: GOOGL and GOOG
  • Annual Revenue: The latest annual revenue of the company is $257.6 billion.

Next Steps

As a next step, I will look at 1) leveraging open source models, 2) post-processing results to create structured tables, 3) scaling this to processing multiple documents.