Topics

Characters to HTML


Axel Berger
 

I don't know if this has been mentioned here before, but I only just found
out that the menu function

Characters to HTML --> Extendended Characters

which I use often as one step in clips is broken. All those 32 characters
that are part of cp-1252 but not ISO-8859-1 are not converted. I have just
added the following to my clip:

^!Replace "€" >> "€" HASTI
^!Replace "‚" >> "‚" HASTI
^!Replace "ƒ" >> "ƒ" HASTI
^!Replace "„" >> "„" HASTI
^!Replace "…" >> "…" HASTI
^!Replace "†" >> "†" HASTI
^!Replace "‡" >> "‡" HASTI
^!Replace "ˆ" >> "ˆ" HASTI
^!Replace "‰" >> "‰" HASTI
^!Replace "Š" >> "Š" HASTI
^!Replace "‹" >> "‹" HASTI
^!Replace "Œ" >> "Œ" HASTI
^!Replace "�" >> "Z" HASTI
^!Replace "‘" >> "‘" HASTI
^!Replace "’" >> "’" HASTI
^!Replace "“" >> "“" HASTI
^!Replace "”" >> "”" HASTI
^!Replace "•" >> "•" HASTI
^!Replace "–" >> "–" HASTI
^!Replace "—" >> "—" HASTI
^!Replace "˜" >> "˜" HASTI
^!Replace "™" >> "™" HASTI
^!Replace "š" >> "š" HASTI
^!Replace "›" >> "›" HASTI
^!Replace "œ" >> "œ" HASTI
^!Replace "ž" >> "z" HASTI
^!Replace "Ÿ" >> "Ÿ" HASTI


--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
 X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --


Marcelo Bastos
 

On 02/09/2020 18:34, Axel Berger wrote:
I don't know if this has been mentioned here before, but I only just found
out that the menu function

Characters to HTML --> Extendended Characters

which I use often as one step in clips is broken. All those 32 characters
that are part of cp-1252 but not ISO-8859-1 are not converted. I have just
added the following to my clip:
^!Replace "€" >> "€" HASTI
^!Replace "‚" >> "‚" HASTI
^!Replace "ƒ" >> "ƒ" HASTI
^!Replace "„" >> "„" HASTI
^!Replace "…" >> "…" HASTI
^!Replace "†" >> "†" HASTI
^!Replace "‡" >> "‡" HASTI
^!Replace "ˆ" >> "ˆ" HASTI
^!Replace "‰" >> "‰" HASTI
^!Replace "Š" >> "Š" HASTI
^!Replace "‹" >> "‹" HASTI
^!Replace "Œ" >> "Œ" HASTI
^!Replace "�" >> "Z" HASTI
^!Replace "‘" >> "‘" HASTI
^!Replace "’" >> "’" HASTI
^!Replace "“" >> "“" HASTI
^!Replace "”" >> "”" HASTI
^!Replace "•" >> "•" HASTI
^!Replace "–" >> "–" HASTI
^!Replace "—" >> "—" HASTI
^!Replace "˜" >> "˜" HASTI
^!Replace "™" >> "™" HASTI
^!Replace "š" >> "š" HASTI
^!Replace "›" >> "›" HASTI
^!Replace "œ" >> "œ" HASTI
^!Replace "ž" >> "z" HASTI
^!Replace "Ÿ" >> "Ÿ" HASTI
That's... _mostly_ OK, if you want to take a Win-1252-encoded page and
convert to a ISO-8859-1 code in HTML. However, two of your lines are not
quite right -- the ones dealing with the "Z with a caron" character
(both cases).

First, replacing them with a plain "z" is suboptimal; if you happen to
be dealing with a language where z-with-a-caron is a real character,
like a few of the Balkans languages, you would end up with incorrect
spelling. Better choices would be to use Ž / ž, but those
appear to have only been standardized on HTML5, and are not supported in
earlier versions of HTML. So the best is to use the appropriate Unicode
numeric entity, that is, either Ž / ž (in decimal) or Ž
/ ž (hexadecimal).

Second,  "�" is *already* a valid HTML numeric character entity
that should work in any encoding, so it doesn't require conversion.
Also, it's *not* the code for "capital-Z-with-a-caron" (which is
Win-1252 character 142); rather, it is the code for "replacement
character", that is, that diamond shape with a question mark inside
which shows up when your computer does not have an appropriate font for
a particular Unicode code point.

So, I would suggest the following lines instead:

^!Replace "Ž" >> "Ž" HAST
^!Replace "ž" >> "ž" HAST

I have also ran into HTML pages (mostly old, poorly-maintained ones) with invalid numeric references pointing to Win-1252 values in the 128-159 range; I find it useful to replace them with named character entities or the correct numeric entities, like so:

^!Replace "€" >> "€" TWSA
^!Replace "‚" >> "‚" TWSA
^!Replace "ƒ" >> "ƒ" TWSA
^!Replace "„" >> "„" TWSA
^!Replace "…" >> "…" TWSA
^!Replace "†" >> "†" TWSA
^!Replace "‡" >> "‡" TWSA
^!Replace "ˆ" >> "ˆ" TWSA
^!Replace "‰" >> "‰" TWSA
^!Replace "Š" >> "Š" TWSA
^!Replace "‹" >> "‹" TWSA
^!Replace "Œ" >> "Œ" TWSA
^!Replace "Ž" >> "Ž" TWSA
^!Replace "‘" >> "‘" TWSA
^!Replace "’" >> "’" TWSA
^!Replace "“" >> "“" TWSA
^!Replace "”" >> "”" TWSA
^!Replace "•" >> "•" TWSA
^!Replace "–" >> "–" TWSA
^!Replace "—" >> "—" TWSA
^!Replace "˜" >> "˜" TWSA
^!Replace "™" >> "™" TWSA
^!Replace "š" >> "š" TWSA
^!Replace "›" >> "›" TWSA
^!Replace "œ" >> "œ" TWSA
^!Replace "ž" >> "ž" TWSA
^!Replace "Ÿ" >> "Ÿ" TWSA

I wouldn't use the "I" flag for this kind of replacement, since we are going for specific character replacements, not "letter" replacements, and some of the replaced characters are technically uppercase/lowercase pairs, which the "I" would treat as the same (at least in theory; it may be that the regexp engine in Notetab does not include an upper/lowercase table for Eastern Europe alphabets). And I prefer to use "W" instead of "H" in this case, because, well, in my use cases, those characters/entities will cause trouble *wherever* they are in the document, so I don't want to leave any stragglers behind if I happen to select only part of the document.

--
MCBastos

This message has been protected with the 2ROT13 algorithm. Unauthorized use will be prosecuted under the DMCA.


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


Axel Berger
 

Marcelo Bastos wrote:
if you want to take a Win-1252-encoded page and
convert to a ISO-8859-1 code in HTML.
Actually all my own pages are US-ASCII, the most basic variant. All those
characters, if I encounter them at all, come from copy/pasting or saving
text written by others.

However, two of your lines are not
quite right -- the ones dealing with the "Z with a caron" character
(both cases).
Yes, I was quite aware of that. The basic unaccented letter was the best I
could come up with and better than nothing.

but those appear to have only been standardized on HTML5,
Exactly. I write and specify HTML 4.01 Transitional .

numeric entity, that is, either Ž / ž (in decimal)
But can I rely on those being supported everywhere and are they part of the
HTML 4 standard? I was not sure about that and preferred to fall back to
the safe side.

Second, "�" is *already* a valid HTML numeric character entity
It is, but it's not what I wrote. You must be aware that this list often
munges posts. I had expected that but assumed that from my logically
ascending list readers would be able to restore the original where broken.
By the way, the rest of my post looks alright, but what you quoted back at
me is transferred to UTF-8. Copied from your post not a single one of those
replacements will work.

I have also ran into HTML pages (mostly old, poorly-maintained ones)
with invalid numeric references pointing to Win-1252 values in the
128-159 range;
I find that even in many new and current pages. I have another clip for
many (most) things in the range from   to ÿ In my case I convert
everything to cp-1252, as not everything (or few things) I saved as or from
HTML are meant to be used as such. A sample line reads

^!Replace "(&\#128;|€|&\#8364;)" >> "€" HRAST

That clip is one of the things in my clip bar.

I wouldn't use the "I" flag for this kind of replacement,
Absolutely right! I should have spooted that one myself. Thanks.


--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
 X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --


Marcelo Bastos
 

On 02/09/2020 22:04, Axel Berger wrote:

numeric entity, that is, either Ž / ž (in decimal)
But can I rely on those being supported everywhere and are they part of the
HTML 4 standard? I was not sure about that and preferred to fall back to
the safe side.
Yes, they are. Numeric entities are enshrined in the HTML 4.01 standard,
but they actually come from way back in SGML.

https://www.w3.org/TR/html4/charset.html#entities

The HTML 3.2 standard, in fact, *uses* numeric entities to *define* the
named entities:

https://www.w3.org/TR/2018/SPSD-html32-20180315/#latin1

TLDR: Numeric entities ALWAYS work. Probably work even in HTML 0.9.


--
MCBastos

This message has been protected with the 2ROT13 algorithm. Unauthorized use will be prosecuted under the DMCA.


--
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus