Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing improvements for #31 #32

Open
wants to merge 32 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
cee262f
Handle numeric argument values without quotes
SL-Gundam Feb 3, 2018
d0da9e0
Adjust fixInlineElementSpacing to not trigger for emptyTags
SL-Gundam Feb 3, 2018
d64cd74
Allow to disable adding the CSS class after the tag
SL-Gundam Feb 4, 2018
d2eeedc
Adjusted test case for this commit d0da9e0fae21d160bef3e6c16dbfd5d0fe…
SL-Gundam Feb 4, 2018
6f40bcc
Fix URL difference on ending slash presence
SL-Gundam Feb 7, 2018
56f4424
Handle unquoted attribute values
SL-Gundam Feb 7, 2018
ea771c3
Escape all * and _ instead of just 1 or 2
SL-Gundam Feb 7, 2018
e024197
Cleanup redundant spaces
SL-Gundam Feb 9, 2018
9ab0dc9
One str_replace instead of three
SL-Gundam Feb 9, 2018
6c9d979
Adjust testcase for ea771c3774deb13ee362ce08a55b76166c600f80
SL-Gundam Feb 15, 2018
eea8837
Add back the final rtrim that was removed in e0241971af582ba3cd6ec0d8…
SL-Gundam Feb 15, 2018
55e5f54
Correct test case for e0241971af582ba3cd6ec0d8c6b82891ba9251b9
SL-Gundam Feb 15, 2018
84d75cc
Merge branch 'master' into Parsing_Improvements
SL-Gundam Feb 15, 2018
7a0f6a1
Change all html EOLs to line feeds
SL-Gundam Feb 17, 2018
e4a1f9b
flushLinebreaks added before handling text
SL-Gundam Feb 17, 2018
a72f108
Decreased the number of lineBreaks after blockelements
SL-Gundam Feb 17, 2018
75cf897
Added ltrim for html content after closing p tag
SL-Gundam Feb 17, 2018
552ba1b
Ignore Office namespace o:p tags
SL-Gundam Feb 26, 2018
0bea029
Add function getescapeInText
SL-Gundam Feb 27, 2018
475af00
Fix header markdown escaping
SL-Gundam Feb 27, 2018
dc77382
Add escaping for = markdown headers
SL-Gundam Feb 27, 2018
27131ac
Add proper amount of slashes for escape regex's
SL-Gundam Feb 27, 2018
f6a3290
Correction incase the last attribute is unquoted
SL-Gundam Mar 2, 2018
9efd59c
Replace   with normal space
SL-Gundam Mar 2, 2018
fc34c27
Add more character to escapeInText
SL-Gundam Mar 3, 2018
2c329f1
Merge branch 'master' of github.com:Elephant418/Markdownify into Pars…
SL-Gundam Feb 1, 2019
e4f91ce
Merge branch 'master' of github.com:Elephant418/Markdownify into Pars…
SL-Gundam Feb 23, 2019
fd6763e
Allow numbers in xmlns attributes names
SL-Gundam Jul 12, 2019
b9b3f41
Fix empty table tag
SL-Gundam May 1, 2020
9623ae4
Correct indentation
SL-Gundam May 1, 2020
7db946f
Fix PHP8.3 support
tzi Feb 23, 2024
7c97008
Merge pull request #1 from Elephant418/fix-php8.3
SL-Gundam Jun 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 37 additions & 17 deletions src/Converter.php
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ class Converter
protected $ignore = [
'html',
'body',
'o:p',
];

/**
Expand Down Expand Up @@ -183,15 +184,15 @@ class Converter
* TODO: what's with block chars / sequences at the beginning of a block?
*/
protected $escapeInText = [
'\*\*([^*]+)\*\*' => '\*\*$1\*\*', // strong
'\*([^*]+)\*' => '\*$1\*', // em
'__(?! |_)(.+)(?!<_| )__' => '\_\_$1\_\_', // strong
'_(?! |_)(.+)(?!<_| )_' => '\_$1\_', // em
'\*' => '\\\\*', // *
'_' => '\\\\_', // _
'\|' => '\\\\|', // |
'([-*_])([ ]{0,2}\1){2,}' => '\\\\$0', // hr
'`' => '\`', // code
'\[(.+)\](\s*\()' => '\[$1\]$2', // links: [text] (url) => [text\] (url)
'\[(.+)\](\s*)\[(.*)\]' => '\[$1\]$2\[$3\]', // links: [text][id] => [text\][id\]
'^#(#{0,5}) ' => '\#$1 ', // header
'`' => '\\\\`', // code
'\[(.+)\](\s*\()' => '\\\\[$1\\\\]$2', // links: [text] (url) => [text\] (url)
'\[(.+)\](\s*)\[(.*)\]' => '\\\\[$1\\\\]$2\\\\[$3\\\\]', // links: [text][id] => [text\][id\]
'^#(#{0,5}) ' => '\\\\#$1 ', // header #
'^=(=*\h*)$' => '\\\\=$1', // header =
];

/**
Expand Down Expand Up @@ -227,6 +228,13 @@ class Converter
*/
protected $indent = '';

/**
* previous indentation, when we want to disable current indentation and get it back later
*
* @var string
*/
static $previousIndent = '';

/**
* constructor, set options, setup parser
*
Expand Down Expand Up @@ -255,7 +263,7 @@ public function __construct($linkPosition = self::LINK_AFTER_CONTENT, $bodyWidth
$search = [];
$replace = [];
foreach ($this->escapeInText as $s => $r) {
array_push($search, '@(?<!\\\)' . $s . '@U');
array_push($search, '@(?<!\\\)' . $s . '@mU');
array_push($replace, $r);
}
$this->escapeInText = [
Expand All @@ -274,6 +282,7 @@ public function parseString($html)
{
$this->resetState();

$html = str_replace(array("\r\n", "\r"), "\n", $html);
$this->parser->html = $html;
$this->parse();

Expand Down Expand Up @@ -302,6 +311,16 @@ public function setKeepHTML($keepHTML)
$this->keepHTML = $keepHTML;
}

/**
* return escapeInText
*
* @return array escapeInText
*/
public function getescapeInText()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: should this be getEscapeInText?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the variables use a construction where the first word is not capped like here https://github.com/Elephant418/Markdownify/blob/master/src/Converter.php#L169 and here https://github.com/Elephant418/Markdownify/blob/master/src/Converter.php#L185

I kept the function name to adhere to the same name as the variable being retrieved. @tzi so far has not commented on this. But i think he will go over the code when I've finished all of my "improvements" and added the associated test cases per his request

So the short question is. Should the name of the function be 100% exactly the same as the variable being retrieved? or are there guidelines which decide the function name?

{
return $this->escapeInText;
}

/**
* iterate through the nodes and decide what we
* shall do with the current node
Expand Down Expand Up @@ -329,6 +348,7 @@ protected function parse()
// else drop
break;
case 'text':
$this->flushLinebreaks();
$this->handleText();
break;
case 'tag':
Expand Down Expand Up @@ -395,7 +415,8 @@ protected function parse()
}
}
// cleanup
$this->output = rtrim(str_replace('&amp;', '&', str_replace('&lt;', '<', str_replace('&gt;', '>', $this->output))));
$this->output = implode("\n", array_map('rtrim', explode("\n", $this->output)));
$this->output = rtrim(str_replace(['&amp;', '&lt;', '&gt;', '&nbsp;'], ['&', '<', '>', ' '], $this->output));
// end parsing, flush stacked tags
$this->flushFootnotes();
$this->stack = [];
Expand Down Expand Up @@ -507,7 +528,7 @@ protected function handleTagToText()
{
if (!$this->keepHTML) {
if (!$this->parser->isStartTag && $this->parser->isBlockElement) {
$this->setLineBreaks(2);
$this->setLineBreaks(1);
}
} else {
// don't convert to markdown inside this tag
Expand All @@ -534,8 +555,7 @@ protected function handleTagToText()
// don't indent inside <pre> tags
if ($this->parser->tagName == 'pre') {
$this->out($this->parser->node);
static $indent;
$indent = $this->indent;
$this->previousIndent = $this->indent;
$this->indent = '';
} else {
$this->out($this->parser->node . "\n" . $this->indent);
Expand All @@ -556,8 +576,7 @@ protected function handleTagToText()
} else {
// reset indentation
$this->out($this->parser->node);
static $indent;
$this->indent = $indent;
$this->indent = $this->previousIndent;
}

if (in_array($this->parent(), ['ins', 'del'])) {
Expand All @@ -579,7 +598,7 @@ protected function handleTagToText()
$this->buffer();
} else {
// add stuff so cleanup just reverses this
$this->out(str_replace('&lt;', '&amp;lt;', str_replace('&gt;', '&amp;gt;', $this->unbuffer())));
$this->out(str_replace(['&lt;', '&gt;'], ['&amp;lt;', '&amp;gt;'], $this->unbuffer()));
}
}
}
Expand Down Expand Up @@ -733,6 +752,7 @@ protected function handleTag_p()
{
if (!$this->parser->isStartTag) {
$this->setLineBreaks(2);
$this->parser->html = ltrim($this->parser->html);
}
}

Expand Down Expand Up @@ -786,7 +806,7 @@ protected function handleTag_a_converter($tag, $buffer)
return '[' . $buffer . ']()';
}

if ($buffer == $tag['href'] && empty($tag['title'])) {
if (rtrim($buffer, '/') == rtrim($tag['href'], '/') && empty($tag['title'])) {
// <http://example.com>
return '<' . $buffer . '>';
}
Expand Down
36 changes: 28 additions & 8 deletions src/ConverterExtra.php
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,13 @@ class ConverterExtra extends Converter
*/
protected $row = 0;

/**
* Add CSS class after the tag
*
* @var bool
*/
protected $addCssClass = true;

/**
* constructor, see Markdownify::Markdownify() for more information
*/
Expand Down Expand Up @@ -118,7 +125,7 @@ protected function handleHeader($level)
$this->stack();
} else {
$tag = $this->unstack();
if (!empty($tag['cssSelector'])) {
if (!empty($tag['cssSelector']) && $this->addCssClass) {
// {#id.class}
$this->out(' {' . $tag['cssSelector'] . '}');
}
Expand Down Expand Up @@ -148,7 +155,7 @@ protected function handleTag_a_parser()
protected function handleTag_a_converter($tag, $buffer)
{
$output = parent::handleTag_a_converter($tag, $buffer);
if (!empty($tag['cssSelector'])) {
if (!empty($tag['cssSelector']) && $this->addCssClass) {
// [This link][id]{#id.class}
$output .= '{' . $tag['cssSelector'] . '}';
}
Expand Down Expand Up @@ -295,13 +302,15 @@ protected function handleTag_table()
$rows = [];
// add padding
array_walk_recursive($this->table['rows'], [&$this, 'alignTdContent']);
$header = array_shift($this->table['rows']);
array_push($rows, '| ' . implode(' | ', $header) . ' |');
array_push($rows, $separator);
foreach ($this->table['rows'] as $row) {
array_push($rows, '| ' . implode(' | ', $row) . ' |');
if (!empty( $this->table['rows'])) {
$header = array_shift($this->table['rows']);
array_push($rows, '| ' . implode(' | ', $header) . ' |');
array_push($rows, $separator);
foreach ($this->table['rows'] as $row) {
array_push($rows, '| ' . implode(' | ', $row) . ' |');
}
$this->out(implode("\n" . $this->indent, $rows));
}
$this->out(implode("\n" . $this->indent, $rows));
$this->table = [];
$this->setLineBreaks(2);
}
Expand Down Expand Up @@ -568,4 +577,15 @@ protected function getCurrentCssSelector()
}
return $cssSelector;
}

/**
* set add CSS class after the tag
*
* @param bool $addCssClass
* @return void
*/
public function setAddCssClass($addCssClass)
{
$this->addCssClass = $addCssClass;
}
}
22 changes: 19 additions & 3 deletions src/Parser.php
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ class Parser
public static $skipWhitespace = true;
public static $a_ord;
public static $z_ord;
public static $n0_ord;
public static $n9_ord;
public static $special_ords;

/**
Expand Down Expand Up @@ -172,6 +174,7 @@ class Parser
'noframes' => true,
'noscript' => true,
'ol' => true,
'o:p' => true,
'p' => true,
'pre' => true,
'table' => true,
Expand Down Expand Up @@ -354,6 +357,8 @@ protected function parseTag()
if (!isset(static::$a_ord)) {
static::$a_ord = ord('a');
static::$z_ord = ord('z');
static::$n0_ord = ord('0');
static::$n9_ord = ord('9');
static::$special_ords = [
ord(':'), // for xml:lang
ord('-'), // for http-equiv
Expand All @@ -370,7 +375,7 @@ protected function parseTag()
// get tagName
while (isset($this->html[$pos])) {
$pos_ord = ord(strtolower($this->html[$pos]));
if (($pos_ord >= static::$a_ord && $pos_ord <= static::$z_ord) || (!empty($tagName) && is_numeric($this->html[$pos]))) {
if (($pos_ord >= static::$a_ord && $pos_ord <= static::$z_ord) || (!empty($tagName) && is_numeric($this->html[$pos])) || in_array($pos_ord, static::$special_ords)) {
$tagName .= $this->html[$pos];
$pos++;
} else {
Expand Down Expand Up @@ -410,13 +415,13 @@ protected function parseTag()
}

$pos_ord = ord(strtolower($this->html[$pos]));
if (($pos_ord >= static::$a_ord && $pos_ord <= static::$z_ord) || in_array($pos_ord, static::$special_ords)) {
if (($pos_ord >= static::$a_ord && $pos_ord <= static::$z_ord) || in_array($pos_ord, static::$special_ords) || (substr($currAttrib, 0, 5) === 'xmlns' && $pos_ord >= static::$n0_ord && $pos_ord <= static::$n9_ord)) {
// attribute name
$currAttrib .= $this->html[$pos];
} elseif (in_array($this->html[$pos], [' ', "\t", "\n"])) {
// drop whitespace
} elseif (in_array($this->html[$pos] . $this->html[$pos + 1], ['="', "='"])) {
// get attribute value
// get quoted attribute value
$pos++;
$await = $this->html[$pos]; // single or double quote
$pos++;
Expand All @@ -427,6 +432,17 @@ protected function parseTag()
}
$attributes[$currAttrib] = $value;
$currAttrib = '';
} elseif ($this->html[$pos] === '=') {
// get unquoted attribute value
$pos++;
$value = '';
while (isset($this->html[$pos]) && !in_array($this->html[$pos], array(' ', '/', '>'), true)) {
$value .= $this->html[$pos];
$pos++;
}
$pos--;
$attributes[$currAttrib] = $value;
$currAttrib = '';
} else {
$this->invalidTag();

Expand Down
12 changes: 6 additions & 6 deletions test/ConverterTestCase.php
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ public function providerBlockquoteConversion()
$data['simple']['md'] = '> blockquoted text goes here';
$data['paragraphs']['html'] = '<blockquote><p>paragraph1</p><p>paragraph2</p></blockquote>';
$data['paragraphs']['md'] = '> paragraph1' . PHP_EOL
. '> ' . PHP_EOL
. '>' . PHP_EOL
. '> paragraph2';
$data['cascade']['html'] = '<blockquote><blockquote>cascading blockquote</blockquote></blockquote>';
$data['cascade']['md'] = '> > cascading blockquote';
Expand Down Expand Up @@ -209,7 +209,7 @@ public function providerListConversion()
. ' 2. Magic';
$data['next-to-text-in-block-context']['html'] = '<blockquote>McHale<ol><li>Bird</li><li>Magic</li></ol></blockquote>';
$data['next-to-text-in-block-context']['md'] = '> McHale' . PHP_EOL
. '> ' . PHP_EOL
. '>' . PHP_EOL
. '> 1. Bird' . PHP_EOL
. '> 2. Magic';
$data['next-to-bold']['html'] = '<b>McHale</b><ol><li>Bird</li><li>Magic</li></ol>';
Expand All @@ -218,7 +218,7 @@ public function providerListConversion()
. ' 1. Bird' . PHP_EOL
. ' 2. Magic';
$data['next-to-bold-and-br']['html'] = '<b>McHale</b><br><ol><li>Bird</li><li>Magic</li></ol>';
$data['next-to-bold-and-br']['md'] = '**McHale** ' . PHP_EOL
$data['next-to-bold-and-br']['md'] = '**McHale**' . PHP_EOL
. PHP_EOL
. PHP_EOL
. ' 1. Bird' . PHP_EOL
Expand Down Expand Up @@ -482,7 +482,7 @@ public function providerRulesConversion()
$data['escape-']['html'] = '-----------------------------------';
$data['escape-']['md'] = '\---\---\---\---\---\---\---\---\---\---\-----';
$data['escape-']['html'] = '*****************';
$data['escape-']['md'] = '\***\***\***\***\*****';
$data['escape-']['md'] = '\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*';

return $data;
}
Expand All @@ -504,9 +504,9 @@ public function providerFixBreaks()
{
$data = [];
$data['break1']['html'] = "<strong>Hello,<br>How are you doing?</strong>";
$data['break1']['md'] = "**Hello, \nHow are you doing?**";
$data['break1']['md'] = "**Hello,\nHow are you doing?**";
$data['break2']['html'] = "<b>Hey,<br> How you're doing?</b><br><br><b>Sorry<br><br> You can't get through</b>";
$data['break2']['md'] = "**Hey, \nHow you're doing?** \n \n**Sorry \n \nYou can't get through**";
$data['break2']['md'] = "**Hey,\nHow you're doing?**\n\n**Sorry\n\nYou can't get through**";

return $data;
}
Expand Down