Detect character set for files, streams and other bytes.
Detection of character sets with a simple and redesigned interface.
This package is based on Ude and since version 2 also on uchardet, which are ports of the Mozilla Universal Charset Detector.
The interface and other classes has been resigned so it's easier to use and better object oriented design (OOD). Unit tests and CI has been added.
Features:
Remarks: You can still register your EncodingProvider
so that the Encoding.GetEncoding(...)
method first tries to find in it.
Use the static detectX methods from CharsetDetector
.
// Detect from File (NET standard 1.3+ or .NET 4+) DetectionResult result = CharsetDetector.DetectFromFile("path/to/file.txt"); // or pass FileInfo // Detect from Stream (NET standard 1.3+ or .NET 4+) result = CharsetDetector.DetectFromStream(stream); // Detect from bytes results = CharsetDetector.DetectFromBytes(byteArray); // Get the best Detection DetectionDetail resultDetected = results.Detected; // Get the alias of the found encoding string encodingName = resultDetected.EncodingName; // Get the System.Text.Encoding of the found encoding (can be null if not available) Encoding encoding = resultDetected.Encoding; // Get the confidence of the found encoding (between 0 and 1) float confidence = resultDetected.Confidence; // Get all the details of the result IList<DetectionDetail> allDetails = result.Details;
The article "A composite approach to language/encoding detection" describes the charsets detection algorithms implemented by the library.
The following charsets are supportedEncodings with BOM: utf-7
, utf-8
, utf-16be
/utf-16le
, utf-32be
/utf-32le
, X-ISO-10646-UCS-4-34121
/X-ISO-10646-UCS-4-21431
, gb18030
.
Encodings without BOM are presented in the table, separated by languages:
Language Encodings International (Unicode)utf-8
Arabic iso-8859-6
, windows-1256
Bulgarian iso-8859-5
, windows-1251
Chinese iso-2022-cn
, big5
, euc-tw
, gb18030
, hz-gb-2312
Croatian iso-8859-2
, iso-8859-13
, iso-8859-16
, windows-1250
, ibm852
, x-mac-ce
Czech windows-1250
, iso-8859-2
, ibm852
, x-mac-ce
Danish iso-8859-1
, iso-8859-15
, windows-1252
English ascii
Esperanto iso-8859-3
Estonian iso-8859-4
, iso-8859-13
, iso-8859-13
, windows-1252
, windows-1257
Finnish iso-8859-1
, iso-8859-4
, iso-8859-9
, iso-8859-13
, iso-8859-15
, windows-1252
French iso-8859-1
, iso-8859-15
, windows-1252
German iso-8859-1
, windows-1252
Greek iso-8859-7
, windows-1253
Hebrew iso-8859-8
, windows-1255
Hungarian iso-8859-2
, windows-1250
Irish Gaelic iso-8859-1
, iso-8859-9
, iso-8859-15
, windows-1252
Italian iso-8859-1
, iso-8859-3
, iso-8859-9
, iso-8859-15
, windows-1252
Japanese iso-2022-jp
, shift-jis
, euc-jp
Korean iso-2022-kr
, euc-kr
/uhc
, cp949
Lithuanian iso-8859-4
, iso-8859-10
, iso-8859-13
Latvian iso-8859-4
, iso-8859-10
, iso-8859-13
Maltese iso-8859-3
Polish iso-8859-2
, iso-8859-13
, iso-8859-16
, windows-1250
, ibm852
, x-mac-ce
Portuguese iso-8859-1
, iso-8859-9
, iso-8859-15
, windows-1252
Romanian iso-8859-2
, iso-8859-16
, windows-1250
, ibm852
Russian iso-8859-5
, koi8-r
, windows-1251
, x-mac-cyrillic
, ibm855
, ibm866
Slovak windows-1250
, iso-8859-2
, ibm852
, x-mac-ce
Slovene iso-8859-2
, iso-8859-16
, windows-1250
, ibm852
, x-mac-ce
Spanish iso-8859-1
, iso-8859-15
, windows-1252
Swedish iso-8859-1
, iso-8859-4
, iso-8859-9
, iso-8859-15
, windows-1252
Thai tis-620
, iso-8859-11
Turkish iso-8859-3
, iso-8859-9
Vietnamese viscii
, windows-1258
Others windows-1252
Remarks: For some aliases of encoding not available: cp949
, iso-2022-cn
, euc-tw
, iso-8859-10
, iso-8859-16
, viscii
, X-ISO-10646-UCS-4-34121
/X-ISO-10646-UCS-4-21431
. Some of them have been offered a suitable replacement for the return result by DetectionDetail.Encoding
:
cp949
: use ks_c_5601-1987
iso-2022-cn
: use x-cp50227
The library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, it may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL").
Test data has been extracted from Wikipedia and The Project Gutenberg books and is subject to their licenses.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4