-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store message content as HTML #50
base: master
Are you sure you want to change the base?
Conversation
Store the messages as HTML so that all formatting is preserved. Telegram links in a message to other messages of the group or channel are replaced with site links.
Thanks. Will test this soon. |
I tried this and it works, but I think the html template file should be edited to reflect the changes, as the html elements are not rendered. The rss template though seems to be working just fine for now. |
Actually nevermind - I was using the old template for the html website. The new template actually does work just fine. |
Are you sure you deleted the database and then synced again? Because otherwise you would be just applying the new code/template on the old raw text messages. |
I used an existing database, but that shouldn't break existing links on existing installations. Re-syncing large channels may be impractical.
|
I agree, starting over with large channels seems impractical. I could be missing something but what I understand is raw text should not be rendered without escaping and HTML cannot be escaped and I don't think it is possible to differentiate between raw text and HTML. I am unable to have a generalized |
IF we make such a setting, I believe it should be set to True by default in NEW configurations by adding it to the example config.yaml file. However, if the setting does not exist in the config.yaml file (i.e. started with an old version) it should assume it is False. |
Yeah, an |
New sites will preserve message formatting by default. Fix hyperlinks not rendering on existing sites.
Thanks for the feedback! Have made the change. |
Well that makes sense because the setting is meant to be constant; otherwise normally all projects should be set to use the html one. The real point of the setting is to not break previous setups by preventing a mix of text and html messages. If this behaviour is confusing, one solution would be to add description next to the option about its nature and that it should not be changed after the first sync. Another solution would be to remove the option entirely, and pull all new messages as html, regardless of the previous ones. Then, when generating the html/rss templates, check each message to see if it is text or html, and handle it accordingly. I believe a good check would be to check the datatype, but really any method should suffice. Actually now that I think about it, the second solution makes way more sense but for some reason I didn't think of it before. |
Yep, the second option is better. The |
Thanks @farzat, but with the option you suggest we need to determine whether a message is raw text or HTML.
msg.raw_text -
msg.raw_text - I think we can't reliably differentiate (as existing raw text messages could be like example 1) without storing some information:
From these options, 1 seemed better to me because it is consistent. However, 2 is more flexible, especially for existing sites. @knadh If we want to also have the ability to switch after syncing the site and then build, we will have to store both raw text (content) and HTML (content_html), right? This could increase the database size. Handling this for existing sites adds some cases as well. |
<script src="main.js"/> I don't think this can be the case. If you type out HTML tags, Telegram encodes it. The above message will come as |
I guess Telegram returns |
Yes, as expected, |
return _NL2BR.sub("\n\n", s).replace("\n", "\n<br />") | ||
return _NL2BR.sub("\n\n", str(s)).replace("\n", "\n<br />") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use this? Whether the s variable is a string or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be removed now as urlize() does a cast already.
The re.sub() method expects a string and there are NULL/None messages which need to be cast to string. This was indirectly taken care of by the escape filter in the master branch. Since I changed the way filters work in the first commit, I had to make this cast.
Ok I guess the way we could do this is just adding a new field in the database, for example specifying which version of tg-archive was used to sync this message. In our case, the type or content of the field doesn't matter, but its very existence suggests that this message was synced after this pull request, which means that the content is html, or otherwise plain text. This way, the option is also removed (all new messages are html), simplifying the config file. This of course stems from my philosophy of limiting options to the user. Raw text support was brought up here simply for the sake of backwards compatability, so as long as we can keep backwards compatability without adding the option then the option shouldn't exist. |
I think the example I chose of version number is especially useful because it will also be helpful in similar cases in the future as well. Empty fields represent versions older than whatever the current version is. However, as I mentioned in the last comment, for this very paritcular case, any field with whatever random content should also do the job. |
I agree. Some information has to be stored somewhere which is ultimately due to the Telethon If we store version number for future purposes other than this, it will require tracking which version introduced what feature/change. We have to check against a message's tg-archive version before doing something. I see how it can be used more generally, though. Can you @knadh please check and confirm the |
To summarize the options and their impact, we can have
|
The fourth option also has the advantage of keeping the raw text available in case we wanted to, for example, search the database. Personally I prefer the 3rd option the most (it's my idea after all), and then the 4th option. |
Sorry, this discussion has been idle for a few months now. But I came back to this, a few days ago, thought of a different method - instead of the entire HTML message, storing only the message entities as JSON (e.g. |
Fixes #43
Messages are stored in the database as HTML. This preserves formatting such as bold, italic, underline, strikethrough, monospace and inline links.
Telegram links in a message to other messages (t.me/group/message_id) are replaced with their archival site version.
For example,
t.me/example_group/12
becomesexample_group/site/2022-02.html#12
.