forked from dottxt-ai/outlines
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update CFGGuide to use outlines.fsm.parsing. Enable generate.cfg
- Loading branch information
Showing
23 changed files
with
1,520 additions
and
656 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
import random | ||
|
||
from transformers import AutoTokenizer | ||
|
||
import outlines.grammars | ||
from outlines.caching import cache_disabled | ||
from outlines.fsm.guide import CFGGuide | ||
from outlines.models.transformers import TransformerTokenizer | ||
|
||
from .common import ensure_numba_compiled | ||
|
||
random.seed(42) | ||
|
||
|
||
def get_tiny_tokenizer(): | ||
"""1000 tokens in vocabulary""" | ||
return TransformerTokenizer( | ||
AutoTokenizer.from_pretrained("hf-internal-testing/tiny-random-gpt2") | ||
) | ||
|
||
|
||
benched_grammars = { | ||
"json": outlines.grammars.json, | ||
"arithmetic": outlines.grammars.arithmetic, | ||
} | ||
|
||
|
||
class CFGGuideBenchmark: | ||
params = benched_grammars.keys() | ||
|
||
def setup(self, grammar_name): | ||
self.tokenizer = get_tiny_tokenizer() | ||
ensure_numba_compiled( | ||
self.tokenizer | ||
) # numba not currently used, but will be in the future | ||
self.prebuilt_cfg_guide = CFGGuide( | ||
benched_grammars[grammar_name], self.tokenizer | ||
) | ||
|
||
@staticmethod | ||
def _run_random_cfg(guide): | ||
state = guide.initial_state | ||
token_ids = list(guide.tokenizer.vocabulary.values()) | ||
for i in range(40): | ||
# simulate ordering of logits top prob to lowest prob | ||
random.shuffle(token_ids) | ||
# simulate sampling and state update | ||
next_token_id = next(guide.iter_valid_token_ids(state, token_ids)) | ||
state = guide.get_next_state(state, next_token_id) | ||
|
||
@cache_disabled() | ||
def time_cfg_guide_setup(self, grammar_name): | ||
CFGGuide(benched_grammars[grammar_name], self.tokenizer) | ||
|
||
@cache_disabled() | ||
def time_cfg_guide_run(self, grammar): | ||
self._run_random_cfg(self.prebuilt_cfg_guide) | ||
|
||
@cache_disabled() | ||
def peakmem_cfg_guide_run(self, grammar): | ||
self._run_random_cfg(self.prebuilt_cfg_guide) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Overview | ||
|
||
Outlines allows the use of [Lark](https://github.com/lark-parser/lark) grammars to guide generation. These grammars are used to construct parsers that filter out incompatible tokens during the generation process The result is a generation that adheres to the grammar's production rules. | ||
|
||
# Primer on Creating Grammars | ||
|
||
To create grammars for Outlines, a solid understanding of Lark grammars is necessary. Here's how you can get started: | ||
|
||
- Read Lark's grammars documentations [here](https://lark-parser.readthedocs.io/en/latest/grammar.html). | ||
- Review Outlines' existing grammars [here](/outlines/grammars). | ||
|
||
|
||
# Compatibility With Outlines | ||
|
||
It's important to note that not all Lark grammars work with Outlines. Changes may be necessary to ensure compatability. | ||
|
||
### LALR(1) Parser | ||
|
||
Outlines utilizes Larks LALR(1) parser, meaning the grammar must be unambiguous at least up to the next token (one token lookahead). Read Lark's official LALR(1) parser documentation [here](https://lark-parser.readthedocs.io/en/stable/parsers.html#lalr-1). | ||
|
||
If your grammar is ambiguous, you will recieve the following error at runtime: | ||
|
||
``` | ||
GrammarError: Reduce/Reduce collision in Terminal('B') between the following rules: | ||
``` | ||
|
||
### Regex Terminal Restrictions | ||
|
||
Outlines converts terminals to finite state machines using the [Interegular](https://github.com/MegaIng/interegular/) library. Not all regular expressions work with Interegular, mitigation is described in the subsections which follow. | ||
|
||
|
||
#### Avoid Lookarounds | ||
|
||
Examples of removing lookaround while maintaining the same functionality | ||
|
||
##### Example: Escaped String | ||
|
||
From Outlines' modified `ESCAPED_STRING` in [common.lark](/outlines/grammars/common.lark). | ||
|
||
Before: | ||
``` | ||
_STRING_INNER: /.*?/ | ||
_STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/ | ||
ESCAPED_STRING : "\"" _STRING_ESC_INNER "\"" | ||
``` | ||
|
||
After: | ||
``` | ||
_NON_CONTROL_CHAR: /([^"\\\x00-\x1F\x7F-\x9F])/ | ||
_ESCAPED_CHAR: /\\/ (_NON_CONTROL_CHAR | /\\/ | /"/) | ||
ESCAPED_STRING_INNER: _NON_CONTROL_CHAR | _ESCAPED_CHAR | ||
ESCAPED_STRING: /"/ ESCAPED_STRING_INNER* /"/ | ||
``` | ||
|
||
#### Avoid Backreferences | ||
|
||
Backreferences, for example `([ab]^*)\1`, cannot be simulated by a finite state machine, and will result in an error if used. | ||
|
||
# Creating a Valid Grammar | ||
|
||
You can use Outlines' test suite to verify your grammar. | ||
|
||
### 1) Create Your Grammar | ||
|
||
Create your grammar file named `your_new_grammar.lark`, adhering to the guidelines provided above. Add it to `outlines/grammars/` (ensure attribution is included and license is compatible). | ||
|
||
Update `outlines/grammars.py` with a line including your grammar. | ||
|
||
### 2) Test Your Grammar | ||
|
||
Test grammar for false negatives, ensure sample grammars can be generated: | ||
- Add valid example outputs which are compliant with the grammar to `tests/benchmark/cfg_samples/your_new_grammar/` | ||
- Run the tests for your grammar via `pytest -s tests/fsm/test_cfg_guide.py::test_cfg_grammar_sample -k "your_new_grammar"` | ||
|
||
Test grammar for false positives, ensure invalid outputs aren't generated. | ||
|
||
Currently there isn't a builtin false positive testing utility. It is recommended you smoke test via | ||
``` | ||
from outlines import models, generate, grammars | ||
model = models.transformers("mistralai/Mistral-7B-v0.1") | ||
generator = generate.cfg(model, grammars.your_new_grammar) | ||
result = generator(<your prompt to generate output for your grammar>) | ||
print(result) | ||
``` | ||
|
||
# Converting | ||
There are a few tools available for converting from other grammars to lark. These tools serve as a starting point. However, you will typically need to make additional adjustments to ensure full compatibility and proper functioning within Outlines. | ||
|
||
Tools: | ||
- Larks built in "Nearley-to-Lark" converter https://lark-parser.readthedocs.io/en/latest/tools.html | ||
- Convert ANTLR4 to Lark (Note, most antlr4 grammars are not LALR(1) compatible, so will require additional tweaking) https://github.com/kaby76/Domemtech.Trash/blob/main/src/trconvert/readme.md | ||
- Extract EBNF from Yacc files https://www.bottlecaps.de/rr/ui | ||
|
||
Reference Grammars: | ||
- Github Lark Grammars https://github.com/search?q=path%3A*.lark&type=code | ||
- Github Nearley Grammars https://github.com/search?q=path%3A*.ne+%22-%3E%22&type=code | ||
- Antlr4 grammars https://github.com/antlr/grammars-v4/ | ||
- Grammar zoo https://slebok.github.io/zoo/index.html#html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.