Skip to content

Commit

Permalink
Merge pull request #813 from snipsco/release/0.19.7
Browse files Browse the repository at this point in the history
Release/0.19.7
  • Loading branch information
ClemDoum authored Jun 20, 2019
2 parents b39538e + 6008bfb commit 1b9924b
Show file tree
Hide file tree
Showing 42 changed files with 804 additions and 252 deletions.
2 changes: 0 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,3 @@ script: tox
after_success:
- tox -e coverage-report
- codecov

cache: pip
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,21 @@
# Changelog
All notable changes to this project will be documented in this file.

## [0.19.7]
### Changed
- Re-score ambiguous `DeterministicIntentParser` results based on slots [#791](https://github.com/snipsco/snips-nlu/pull/791)
- Accept ambiguous results from `DeterministicIntentParser` when confidence score is above 0.5 [#797](https://github.com/snipsco/snips-nlu/pull/797)
- Avoid generating number variations when not needed [#799](https://github.com/snipsco/snips-nlu/pull/799)
- Moved the NLU random state from the config to the shared resources [#801](https://github.com/snipsco/snips-nlu/pull/801)
- Reduce custom entity parser footprint in training time [#804](https://github.com/snipsco/snips-nlu/pull/804)
- Bumped `scikit-learn` to `>=0.21,<0.22` for `python>=3.5` and `>=0.20<0.21` for `python<3.5` [#801](https://github.com/snipsco/snips-nlu/pull/801)
- Update dependencies [#811](https://github.com/snipsco/snips-nlu/pull/811)

### Fixed
- Fixed a couple of bugs in the data augmentation which were making the NLU training non-deterministic [#801](https://github.com/snipsco/snips-nlu/pull/801)
- Remove deprecated code in dataset generation [#803](https://github.com/snipsco/snips-nlu/pull/803)
- Fix possible override of entity values when generating variations [#808](https://github.com/snipsco/snips-nlu/pull/808)

## [0.19.6]
### Fixed
- Raise an error when using unknown intents in intents filter [#788](https://github.com/snipsco/snips-nlu/pull/788)
Expand Down Expand Up @@ -269,6 +284,7 @@ several commands.
- Fix compiling issue with `bindgen` dependency when installing from source
- Fix issue in `CRFSlotFiller` when handling builtin entities

[0.19.7]: https://github.com/snipsco/snips-nlu/compare/0.19.6...0.19.7
[0.19.6]: https://github.com/snipsco/snips-nlu/compare/0.19.5...0.19.6
[0.19.5]: https://github.com/snipsco/snips-nlu/compare/0.19.4...0.19.5
[0.19.4]: https://github.com/snipsco/snips-nlu/compare/0.19.3...0.19.4
Expand Down
20 changes: 20 additions & 0 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,26 @@ the dataset we generated earlier:
engine.fit(dataset)
Note that, by default, training of the NLU engine is non-deterministic:
training and testing multiple times on the same data may produce different
outputs.

Reproducible trainings can be achieved by passing a **random seed** to the
engine:

.. code-block:: python
seed = 42
engine = SnipsNLUEngine(config=CONFIG_EN, random_state=seed)
engine.fit(dataset)
.. note::

Due to a ``scikit-learn`` bug fixed in version ``0.21`` we can't guarantee
any deterministic behavior if you're using a Python version ``<3.5`` since
``scikit-learn>=0.21`` is only available starting from Python ``>=3.5``


Parsing
-------
Expand Down
25 changes: 13 additions & 12 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,22 +17,23 @@
readme = f.read()

required = [
"deprecation>=2.0,<3.0",
"enum34>=1.1,<2.0; python_version<'3.4'",
"future>=0.16,<0.17",
"numpy>=1.15,<1.16",
"funcsigs>=1.0,<2.0; python_version<'3.4'",
"future>=0.16,<0.18",
"num2words>=0.5.6,<0.6",
"numpy>=1.15,<2.0",
"pathlib>=1.0,<2.0; python_version<'3.4'",
"plac>=0.9.6,<2.0",
"pyaml>=17.0,<20.0",
"requests>=2.0,<3.0",
"scikit-learn>=0.20,<0.21; python_version<'3.5'",
"scikit-learn>=0.21.1,<0.22; python_version>='3.5'",
"scipy>=1.0,<2.0",
"scikit-learn>=0.19,<0.20",
"sklearn-crfsuite>=0.3.6,<0.4",
"semantic_version>=2.6,<3.0",
"snips-nlu-utils>=0.8,<0.9",
"sklearn-crfsuite>=0.3.6,<0.4",
"snips-nlu-parsers>=0.2,<0.3",
"num2words>=0.5.6,<0.6",
"plac>=0.9.6,<1.0",
"requests>=2.0,<3.0",
"pathlib==1.0.1; python_version < '3.4'",
"pyaml>=17,<18",
"deprecation>=2,<3",
"funcsigs>=1.0,<2.0; python_version < '3.4'"
"snips-nlu-utils>=0.8,<0.9",
]

extras_require = {
Expand Down
2 changes: 1 addition & 1 deletion snips_nlu/__about__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
__email__ = "[email protected], [email protected]"
__license__ = "Apache License, Version 2.0"

__version__ = "0.19.6"
__version__ = "0.19.7"
__model_version__ = "0.19.0"

__download_url__ = "https://github.com/snipsco/snips-nlu-language-resources/releases/download"
Expand Down
24 changes: 16 additions & 8 deletions snips_nlu/cli/generate_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,21 @@

@plac.annotations(
language=("Language of the assistant", "positional", None, str),
files=("List of intent and entity files", "positional", None, str, None,
"filename"))
def generate_dataset(language, *files):
"""Create a Snips NLU dataset from text friendly files"""
yaml_files=("List of intent and entity yaml files", "positional", None,
str, None, "filename"))
def generate_dataset(language, *yaml_files):
"""Creates a Snips NLU dataset from YAML definition files
Check :meth:`.Intent.from_yaml` and :meth:`.Entity.from_yaml` for the
format of the YAML files.
Args:
language (str): language of the dataset (iso code)
*yaml_files: list of intent and entity definition files in YAML format.
Returns:
None. The json dataset output is printed out on stdout.
"""
language = unicode_string(language)
if any(f.endswith(".yml") or f.endswith(".yaml") for f in files):
dataset = Dataset.from_yaml_files(language, list(files))
else:
dataset = Dataset.from_files(language, list(files))
dataset = Dataset.from_yaml_files(language, list(yaml_files))
print(json_string(dataset.json, indent=2, sort_keys=True))
1 change: 1 addition & 0 deletions snips_nlu/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
BUILTIN_ENTITY_PARSER = "builtin_entity_parser"
CUSTOM_ENTITY_PARSER = "custom_entity_parser"
MATCHING_STRICTNESS = "matching_strictness"
RANDOM_STATE = "random_state"

# resources
RESOURCES = "resources"
Expand Down
6 changes: 3 additions & 3 deletions snips_nlu/data_augmentation.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,10 @@ def get_entities_iterators(intent_entities, language,
add_builtin_entities_examples, random_state):
entities_its = dict()
for entity_name, entity in iteritems(intent_entities):
utterance_values = random_state.permutation(list(entity[UTTERANCES]))
utterance_values = random_state.permutation(sorted(entity[UTTERANCES]))
if add_builtin_entities_examples and is_builtin_entity(entity_name):
entity_examples = get_builtin_entity_examples(entity_name,
language)
entity_examples = get_builtin_entity_examples(
entity_name, language)
# Builtin entity examples must be kept first in the iterator to
# ensure that they are used when augmenting data
iterator_values = entity_examples + list(utterance_values)
Expand Down
6 changes: 4 additions & 2 deletions snips_nlu/dataset/intent.py
Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,8 @@ def capture_slot(state):
next_colon_pos = state.find(':')
next_square_bracket_pos = state.find(']')
if next_square_bracket_pos < 0:
raise IntentFormatError("Missing ending ']' in annotated utterance")
raise IntentFormatError(
"Missing ending ']' in annotated utterance \"%s\"" % state.input)
if next_colon_pos < 0 or next_square_bracket_pos < next_colon_pos:
slot_name = state[:next_square_bracket_pos]
state.move(next_square_bracket_pos)
Expand All @@ -327,7 +328,8 @@ def capture_slot(state):
def capture_tagged(state):
next_pos = state.find(')')
if next_pos < 1:
raise IntentFormatError("Missing ending ')' in annotated utterance")
raise IntentFormatError(
"Missing ending ')' in annotated utterance \"%s\"" % state.input)
else:
tagged_text = state[:next_pos]
state.add_tagged(tagged_text)
Expand Down
65 changes: 45 additions & 20 deletions snips_nlu/dataset/validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from future.utils import iteritems, itervalues
from snips_nlu_parsers import get_all_languages

from snips_nlu.common.dataset_utils import (validate_key, validate_keys,
validate_type)
from snips_nlu.constants import (
AUTOMATICALLY_EXTENSIBLE, CAPITALIZE, DATA, ENTITIES, ENTITY, INTENTS,
LANGUAGE, MATCHING_STRICTNESS, SLOT_NAME, SYNONYMS, TEXT, USE_SYNONYMS,
Expand All @@ -18,8 +20,9 @@
from snips_nlu.exceptions import DatasetFormatError
from snips_nlu.preprocessing import tokenize_light
from snips_nlu.string_variations import get_string_variations
from snips_nlu.common.dataset_utils import validate_type, validate_key, \
validate_keys

NUMBER_VARIATIONS_THRESHOLD = 1e3
VARIATIONS_GENERATION_THRESHOLD = 1e4


def validate_and_format_dataset(dataset):
Expand Down Expand Up @@ -111,7 +114,7 @@ def _extract_entity_values(entity):
return values


def _validate_and_format_custom_entity(entity, queries_entities, language,
def _validate_and_format_custom_entity(entity, utterance_entities, language,
builtin_entity_parser):
validate_type(entity, dict, object_label="entity")

Expand Down Expand Up @@ -146,30 +149,48 @@ def _validate_and_format_custom_entity(entity, queries_entities, language,
if not entry[VALUE]:
continue
validate_type(entry[SYNONYMS], list, object_label="entity synonyms")
entry[SYNONYMS] = [s.strip() for s in entry[SYNONYMS]
if len(s.strip()) > 0]
entry[SYNONYMS] = [s.strip() for s in entry[SYNONYMS] if s.strip()]
valid_entity_data.append(entry)
entity[DATA] = valid_entity_data

# Compute capitalization before normalizing
# Normalization lowercase and hence lead to bad capitalization calculation
formatted_entity[CAPITALIZE] = _has_any_capitalization(queries_entities,
formatted_entity[CAPITALIZE] = _has_any_capitalization(utterance_entities,
language)

validated_utterances = dict()
# Map original values an synonyms
for data in entity[DATA]:
ent_value = data[VALUE]
if not ent_value:
continue
validated_utterances[ent_value] = ent_value
if use_synonyms:
for s in data[SYNONYMS]:
if s and s not in validated_utterances:
if s not in validated_utterances:
validated_utterances[s] = ent_value

# Number variations in entities values are expensive since each entity
# value is parsed with the builtin entity parser before creating the
# variations. We avoid generating these variations if there's enough entity
# values

# Add variations if not colliding
all_original_values = _extract_entity_values(entity)
if len(entity[DATA]) < VARIATIONS_GENERATION_THRESHOLD:
variations_args = {
"case": True,
"and_": True,
"punctuation": True
}
else:
variations_args = {
"case": False,
"and_": False,
"punctuation": False
}

variations_args["numbers"] = len(
entity[DATA]) < NUMBER_VARIATIONS_THRESHOLD

variations = dict()
for data in entity[DATA]:
ent_value = data[VALUE]
Expand All @@ -178,10 +199,11 @@ def _validate_and_format_custom_entity(entity, queries_entities, language,
values_to_variate.update(set(data[SYNONYMS]))
variations[ent_value] = set(
v for value in values_to_variate
for v in get_string_variations(value, language,
builtin_entity_parser))
for v in get_string_variations(
value, language, builtin_entity_parser, **variations_args)
)
variation_counter = Counter(
[v for vars in itervalues(variations) for v in vars])
[v for variations_ in itervalues(variations) for v in variations_])
non_colliding_variations = {
value: [
v for v in variations if
Expand All @@ -195,22 +217,25 @@ def _validate_and_format_custom_entity(entity, queries_entities, language,
validated_utterances = _add_entity_variations(
validated_utterances, non_colliding_variations, entry_value)

# Merge queries entities
queries_entities_variations = {
ent: get_string_variations(ent, language, builtin_entity_parser)
for ent in queries_entities
# Merge utterances entities
utterance_entities_variations = {
ent: get_string_variations(
ent, language, builtin_entity_parser, **variations_args)
for ent in utterance_entities
}
for original_ent, variations in iteritems(queries_entities_variations):

for original_ent, variations in iteritems(utterance_entities_variations):
if not original_ent or original_ent in validated_utterances:
continue
validated_utterances[original_ent] = original_ent
for variation in variations:
if variation and variation not in validated_utterances:
if variation and variation not in validated_utterances \
and variation not in utterance_entities:
validated_utterances[variation] = original_ent
formatted_entity[UTTERANCES] = validated_utterances
return formatted_entity


def _validate_and_format_builtin_entity(entity, queries_entities):
def _validate_and_format_builtin_entity(entity, utterance_entities):
validate_type(entity, dict, object_label="builtin entity")
return {UTTERANCES: set(queries_entities)}
return {UTTERANCES: set(utterance_entities)}
6 changes: 2 additions & 4 deletions snips_nlu/default_configs/config_de.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,7 @@
"min_utterances": 200,
"capitalization_ratio": 0.2,
"add_builtin_entities_examples": True
},
"random_seed": None
}
},
"intent_classifier_config": {
"unit_name": "log_reg_intent_classifier",
Expand Down Expand Up @@ -140,8 +139,7 @@
"unknown_words_replacement_string": None,
"keep_order": True
}
},
"random_seed": None
}
}
}
]
Expand Down
6 changes: 2 additions & 4 deletions snips_nlu/default_configs/config_en.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,7 @@
"min_utterances": 200,
"capitalization_ratio": 0.2,
"add_builtin_entities_examples": True
},
"random_seed": None
}
},
"intent_classifier_config": {
"unit_name": "log_reg_intent_classifier",
Expand Down Expand Up @@ -126,8 +125,7 @@
"unknown_words_replacement_string": None,
"keep_order": True
}
},
"random_seed": None
}
}
}
]
Expand Down
5 changes: 2 additions & 3 deletions snips_nlu/default_configs/config_es.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@
"capitalization_ratio": 0.2,
"add_builtin_entities_examples": True
},
"random_seed": None

},
"intent_classifier_config": {
"unit_name": "log_reg_intent_classifier",
Expand Down Expand Up @@ -118,8 +118,7 @@
"unknown_words_replacement_string": None,
"keep_order": True
}
},
"random_seed": None
}
}
}
]
Expand Down
6 changes: 2 additions & 4 deletions snips_nlu/default_configs/config_fr.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@
"min_utterances": 200,
"capitalization_ratio": 0.2,
"add_builtin_entities_examples": True
},
"random_seed": None
}
},
"intent_classifier_config": {
"unit_name": "log_reg_intent_classifier",
Expand Down Expand Up @@ -118,8 +117,7 @@
"unknown_words_replacement_string": None,
"keep_order": True
}
},
"random_seed": None
}
}
}
]
Expand Down
Loading

0 comments on commit 1b9924b

Please sign in to comment.