One of the most frustrating parts of building AI-powered document analysis is chunking. You split your document into pieces, embed each chunk, build a RAG pipeline, and pray that the relevant information ends up in the same retrieval window. It's complex, error-prone, and adds latency.
What if you could just... send the entire document to the AI in one API call?
With 128K-token context windows now available from Moonshot (Kimi) and Qwen, that's exactly what you can do. 128K tokens is roughly 96,000 words — enough for a 300-page book, a complete legal contract, or an entire codebase.
RAG (Retrieval-Augmented Generation) is great when you have millions of documents and need to search across them. But for single-document analysis, long context is simply better:
| Scenario | RAG | Long Context (128K) |
|---|---|---|
| Summarize a 50-page report | May miss cross-section connections | Sees entire document, perfect summary |
| Find contradictions in a contract | Can miss if clauses are in different chunks | Compares all clauses simultaneously |
| Code review of a repository | Loses inter-file dependencies | Understands full project structure |
| Q&A across a textbook | Good for specific lookups | Better for conceptual questions |
from openai import OpenAI client = OpenAI( api_key="nvai-your-api-key", base_url="https://aiapi-pro.com/v1" ) # Read your entire document with open("contract.txt") as f: document = f.read() # Can be up to ~96,000 words response = client.chat.completions.create( model="moonshot-v1-128k", messages=[ {"role": "system", "content": "You are a legal document analyst. Analyze the following contract and identify all obligations, deadlines, and potential risks."}, {"role": "user", "content": document} ] ) print(response.choices[0].message.content)
Processing a 50,000-token document (roughly 37,000 words) with different models:
| Model | Input Cost | Output Cost (2K tokens) | Total |
|---|---|---|---|
| Moonshot-128K (NovAI) | $0.03 | $0.006 | $0.036 |
| Qwen-Max (NovAI) | $0.02 | $0.003 | $0.023 |
| GPT-4o (128K) | $0.125 | $0.02 | $0.145 |
| Claude 3.5 (200K) | $0.15 | $0.02 | $0.17 |
NovAI's models are 4-7x cheaper than GPT-4o and Claude for long document analysis, with comparable quality for most tasks.
Choose the right model: Use Moonshot-128K when you need the full context window and excellent recall. Use Qwen-Max when Chinese text is involved. Use DeepSeek-v3.2 for code analysis.
Structure your prompts: Even with long context, a clear system prompt helps the model focus. Tell it exactly what to look for in the document.
Stream the response: Long context inputs may take a few seconds to process. Use streaming to show results as they generate, improving perceived latency.
Send entire documents to Moonshot-128K and Qwen-Max. $0.50 free credits on signup.
Get Started Free →