generated from OpenDocCN/doc-template
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
73e25a6
commit 8d415dd
Showing
128 changed files
with
10,321 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
[] |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
["```py\nwith open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n raw_text = f.read()\nprint(\"Total number of character:\", len(raw_text))\nprint(raw_text[:99])\n```", "```py\nTotal number of character: 20479\nI HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no \n```", "```py\nimport re\ntext = \"Hello, world. This, is a test.\"\nresult = re.split(r'(\\s)', text)\nprint(result)\n```", "```py\n['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']\n```", "```py\nresult = re.split(r'([,.]|\\s)', text)\nprint(result)\n```", "```py\n['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']\n```", "```py\nresult = [item for item in result if item.strip()]\nprint(result)\n```", "```py\n['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']\n```", "```py\ntext = \"Hello, world. Is this-- a test?\"\nresult = re.split(r'([,.:;?_!\"()\\']|--|\\s)', text)\nresult = [item.strip() for item in result if item.strip()]\nprint(result)\n```", "```py\n['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']\n```", "```py\npreprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', raw_text)\npreprocessed = [item.strip() for item in preprocessed if item.strip()]\nprint(len(preprocessed))\n```", "```py\nprint(preprocessed[:30])\n```", "```py\n['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']\n```", "```py\nall_words = sorted(list(set(preprocessed)))\nvocab_size = len(all_words)\nprint(vocab_size)\n```", "```py\nvocab = {token:integer for integer,token in enumerate(all_words)}\nfor i, item in enumerate(vocab.items()):\n print(item)\n if i > 50:\n break\n```", "```py\n('!', 0)\n('\"', 1)\n(\"'\", 2)\n...\n('Has', 49)\n('He', 50)\n```", "```py\nclass SimpleTokenizerV1:\n def __init__(self, vocab):\n self.str_to_int = vocab #A\n self.int_to_str = {i:s for s,i in vocab.items()} #B\n\n def encode(self, text): #C\n preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n preprocessed = [item.strip() for item in preprocessed if item.strip()]\n ids = [self.str_to_int[s] for s in preprocessed]\n return ids\n\n def decode(self, ids): #D\n text = \" \".join([self.int_to_str[i] for i in ids]) \n\n text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #E\n return text\n```", "```py\ntokenizer = SimpleTokenizerV1(vocab)\n\ntext = \"\"\"\"It's the last he painted, you know,\" Mrs. Gisburn said with pardonable pride.\"\"\"\nids = tokenizer.encode(text)\nprint(ids)\n```", "```py\n[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]\n```", "```py\nprint(tokenizer.decode(ids))\n```", "```py\n'\" It\\' s the last he painted, you know,\" Mrs. Gisburn said with pardonable pride.'\n```", "```py\ntext = \"Hello, do you like tea?\"\ntokenizer.encode(text)\n```", "```py\n...\nKeyError: 'Hello'\n```", "```py\nall_tokens = sorted(list(set(preprocessed)))\nall_tokens.extend([\"<|endoftext|>\", \"<|unk|>\"])\nvocab = {token:integer for integer,token in enumerate(all_tokens)}\n\nprint(len(vocab.items()))\n```", "```py\nfor i, item in enumerate(list(vocab.items())[-5:]):\n print(item)\n```", "```py\n('younger', 1156)\n('your', 1157)\n('yourself', 1158)\n('<|endoftext|>', 1159)\n('<|unk|>', 1160)\n```", "```py\nclass SimpleTokenizerV2:\n def __init__(self, vocab):\n self.str_to_int = vocab\n self.int_to_str = { i:s for s,i in vocab.items()}\n\n def encode(self, text):\n preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n preprocessed = [item.strip() for item in preprocessed if item.strip()]\n preprocessed = [item if item in self.str_to_int #A\n else \"<|unk|>\" for item in preprocessed]\n\n ids = [self.str_to_int[s] for s in preprocessed]\n return ids\n\n def decode(self, ids):\n text = \" \".join([self.int_to_str[i] for i in ids])\n\n text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text) #B\n return text\n```", "```py\ntext1 = \"Hello, do you like tea?\"\ntext2 = \"In the sunlit terraces of the palace.\"\ntext = \" <|endoftext|> \".join((text1, text2))\nprint(text)\n```", "```py\n'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'\n```", "```py\ntokenizer = SimpleTokenizerV2(vocab)\nprint(tokenizer.encode(text))\n```", "```py\n[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160, 7]\n```", "```py\nprint(tokenizer.decode(tokenizer.encode(text)))\n```", "```py\n'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'\n```", "```py\npip install tiktoken\n```", "```py\nfrom importlib.metadata import version\nimport tiktoken\nprint(\"tiktoken version:\", version(\"tiktoken\"))\n```", "```py\ntokenizer = tiktoken.get_encoding(\"gpt2\")\n```", "```py\ntext = \"Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.\"\nintegers = tokenizer.encode(text, allowed_special={\"<|endoftext|>\"})\nprint(integers)\n```", "```py\n[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]\n```", "```py\nstrings = tokenizer.decode(integers)\nprint(strings)\n```", "```py\n'Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.'\n```", "```py\nwith open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n raw_text = f.read()\n\nenc_text = tokenizer.encode(raw_text)\nprint(len(enc_text))\n```", "```py\nenc_sample = enc_text[50:]\n```", "```py\ncontext_size = 4 #A\n\nx = enc_sample[:context_size]\ny = enc_sample[1:context_size+1]\nprint(f\"x: {x}\")\nprint(f\"y: {y}\")\n```", "```py\nx: [290, 4920, 2241, 287]\ny: [4920, 2241, 287, 257]\n```", "```py\nfor i in range(1, context_size+1):\n context = enc_sample[:i]\n desired = enc_sample[i]\n print(context, \"---->\", desired)\n```", "```py\n[290] ----> 4920\n[290, 4920] ----> 2241\n[290, 4920, 2241] ----> 287\n[290, 4920, 2241, 287] ----> 257\n```", "```py\nfor i in range(1, context_size+1):\n context = enc_sample[:i]\n desired = enc_sample[i]\n print(tokenizer.decode(context), \"---->\", tokenizer.decode([desired]))\n```", "```py\n and ----> established\n and established ----> himself\n and established himself ----> in\n and established himself in ----> a\n```", "```py\nimport torch\nfrom torch.utils.data import Dataset, DataLoader\n\nclass GPTDatasetV1(Dataset):\n def __init__(self, txt, tokenizer, max_length, stride):\n self.tokenizer = tokenizer\n self.input_ids = []\n self.target_ids = []\n\n token_ids = tokenizer.encode(txt) #A\n\n for i in range(0, len(token_ids) - max_length, stride): #B\n input_chunk = token_ids[i:i + max_length]\n target_chunk = token_ids[i + 1: i + max_length + 1]\n self.input_ids.append(torch.tensor(input_chunk))\n self.target_ids.append(torch.tensor(target_chunk))\n\n def __len__(self): #C\n return len(self.input_ids)\n\n def __getitem__(self, idx): #D\n return self.input_ids[idx], self.target_ids[idx]\n```", "```py\ndef create_dataloader_v1(txt, batch_size=4, \n max_length=256, stride=128, shuffle=True, drop_last=True):\n tokenizer = tiktoken.get_encoding(\"gpt2\") #A \n dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #B\n dataloader = DataLoader(\n dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last) #C\n return dataloader\n```", "```py\nwith open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n raw_text = f.read()\n\ndataloader = create_dataloader_v1(\n raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)\ndata_iter = iter(dataloader) #A\nfirst_batch = next(data_iter)\nprint(first_batch)\n```", "```py\n[tensor([[ 40, 367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]\n```", "```py\nsecond_batch = next(data_iter)\nprint(second_batch)\n```", "```py\n[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]\n```", "```py\ndataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4)\n\ndata_iter = iter(dataloader)\ninputs, targets = next(data_iter)\nprint(\"Inputs:\\n\", inputs)\nprint(\"\\nTargets:\\n\", targets)\n```", "```py\nInputs:\n tensor([[ 40, 367, 2885, 1464],\n [ 1807, 3619, 402, 271],\n [10899, 2138, 257, 7026],\n [15632, 438, 2016, 257],\n [ 922, 5891, 1576, 438],\n [ 568, 340, 373, 645],\n [ 1049, 5975, 284, 502],\n [ 284, 3285, 326, 11]])\n\nTargets:\n tensor([[ 367, 2885, 1464, 1807],\n [ 3619, 402, 271, 10899],\n [ 2138, 257, 7026, 15632],\n [ 438, 2016, 257, 922],\n [ 5891, 1576, 438, 568],\n [ 340, 373, 645, 1049],\n [ 5975, 284, 502, 284],\n [ 3285, 326, 11, 287]])\n```", "```py\ninput_ids = torch.tensor([2, 3, 5, 1])\n```", "```py\nvocab_size = 6\noutput_dim = 3\n```", "```py\ntorch.manual_seed(123)\nembedding_layer = torch.nn.Embedding(vocab_size, output_dim)\nprint(embedding_layer.weight)\n```", "```py\nParameter containing:\ntensor([[ 0.3374, -0.1778, -0.1690],\n [ 0.9178, 1.5810, 1.3010],\n [ 1.2753, -0.2010, -0.1606],\n [-0.4015, 0.9666, -1.1481],\n [-1.1589, 0.3255, -0.6315],\n [-2.8400, -0.7849, -1.4096]], requires_grad=True)\n```", "```py\nprint(embedding_layer(torch.tensor([3])))\n```", "```py\ntensor([[-0.4015, 0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)\n```", "```py\nprint(embedding_layer(input_ids))\n```", "```py\ntensor([[ 1.2753, -0.2010, -0.1606],\n [-0.4015, 0.9666, -1.1481],\n [-2.8400, -0.7849, -1.4096],\n [ 0.9178, 1.5810, 1.3010]], grad_fn=<EmbeddingBackward0>)\n```", "```py\noutput_dim = 256\nvocab_size = 50257\ntoken_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)\n```", "```py\nmax_length = 4\ndataloader = create_dataloader_v1(\n raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)\ndata_iter = iter(dataloader)\ninputs, targets = next(data_iter)\nprint(\"Token IDs:\\n\", inputs)\nprint(\"\\nInputs shape:\\n\", inputs.shape)\n```", "```py\nToken IDs:\n tensor([[ 40, 367, 2885, 1464],\n [ 1807, 3619, 402, 271],\n [10899, 2138, 257, 7026],\n [15632, 438, 2016, 257],\n [ 922, 5891, 1576, 438],\n [ 568, 340, 373, 645],\n [ 1049, 5975, 284, 502],\n [ 284, 3285, 326, 11]])\n\nInputs shape:\n torch.Size([8, 4])\n```", "```py\ntoken_embeddings = token_embedding_layer(inputs)\nprint(token_embeddings.shape)\n```", "```py\ntorch.Size([8, 4, 256])\n```", "```py\ncontext_length = max_length\npos_embedding_layer = torch.nn.Embedding(context_lengthe, output_dim)\npos_embeddings = pos_embedding_layer(torch.arange(context_length))\nprint(pos_embeddings.shape)\n```", "```py\ntorch.Size([4, 256])\n```", "```py\ninput_embeddings = token_embeddings + pos_embeddings\nprint(input_embeddings.shape)\n```", "```py\ntorch.Size([8, 4, 256])\n```"] |
Oops, something went wrong.