Skip to content

Commit

Permalink
Add questions
Browse files Browse the repository at this point in the history
  • Loading branch information
xiaohanzhan-db committed Jan 8, 2024
1 parent be25591 commit 4651be7
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions scripts/data_prep/validate_and_tokenize_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@
# MAGIC future: Literal[False] = False,
# MAGIC }
# MAGIC - What null checkings do we want to have?
# MAGIC - How to map the model to its expected eos_text / bos_text format? [Ref](https://databricks.slack.com/archives/C05K29T9NBF/p1703644153357929?thread_ts=1703643155.904289&cid=C05K29T9NBF)
# MAGIC - How to automate tokenization for CPT? it is always really standard: sequence -> concat(tok(BOS), tok(sequence), tok(EOS)), and then concatenate sequences. [Ref](https://databricks.slack.com/archives/C05K29T9NBF/p1703698056000399?thread_ts=1703643155.904289&cid=C05K29T9NBF)
# MAGIC ```

# COMMAND ----------
Expand Down

0 comments on commit 4651be7

Please sign in to comment.