Module: EncodingDetector
- Defined in:
- lib/encoding_detector.rb
Overview
Detect encodings of arbitrary strings for later conversion.
This is a naive implementation that works by trying to encode the string to UTF-8 using a list of known encodings, and scoring the results based on the number of invalid characters and total length of the resulting string. The shorter the resulting string, the more likely it is to be valid. Invalid characters are penalised more heavily to avoid false positives and assist with UTF-16 detection (which produces shorter strings).
Constant Summary collapse
- ENCODINGS =
A later encoding will ONLY be chosen IF it scores better than an earlier one
%w[UTF-8 ISO-8859-1 UTF-16].freeze
- MAX_SCORE =
Float::INFINITY
- INVALID_CHARACTER =
'�'
Class Method Summary collapse
-
.convert_to_utf8(content) ⇒ String
Convert the given content to UTF-8, using the detected encoding if possible.
-
.detect(content) ⇒ Hash
Detect the encoding of the given content from the list of known encodings.
- .score_encoding(content, encoding) ⇒ Object
- .test_encoding(content, encoding) ⇒ Object
Class Method Details
.convert_to_utf8(content) ⇒ String
Convert the given content to UTF-8, using the detected encoding if possible. If the encoding cannot be detected, invalid characters will be replaced with the unicode replacement character (�).
68 69 70 71 72 73 74 75 76 77 |
# File 'lib/encoding_detector.rb', line 68 def self.convert_to_utf8(content) detection = detect(content) if detection[:inverted_score] < MAX_SCORE content.force_encoding(detection[:encoding]).encode('UTF-8', invalid: :replace, undef: :replace, replace: INVALID_CHARACTER) else # If detection fails, replace unknown characters with a unicode replacement character content.encode('UTF-8', invalid: :replace, undef: :replace, replace: INVALID_CHARACTER) end end |
.detect(content) ⇒ Hash
Detect the encoding of the given content from the list of known encodings.
22 23 24 25 26 27 28 29 |
# File 'lib/encoding_detector.rb', line 22 def self.detect(content) best_guess = { inverted_score: MAX_SCORE, encoding: 'UNKNOWN' } ENCODINGS.each do |encoding| result = test_encoding(content, encoding) best_guess = result if result[:inverted_score] < best_guess[:inverted_score] end best_guess end |
.score_encoding(content, encoding) ⇒ Object
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/encoding_detector.rb', line 31 def self.score_encoding(content, encoding) return { inverted_score: MAX_SCORE, encoding: encoding } if content.blank? # The shorter the resulting string, the more likely it is to be valid, unescaped text adds characters total_chars = content.length # Invalid characters reduce inverted_score invalid_chars = content.count(INVALID_CHARACTER) # Calculate a score based on the above inverted_score = (total_chars + (invalid_chars * 10)) # penalise invalid chars more # Return the score as a hash { inverted_score: inverted_score, encoding: encoding, _total_chars: total_chars, _invalid_chars: invalid_chars } end |
.test_encoding(content, encoding) ⇒ Object
52 53 54 55 56 57 58 59 60 |
# File 'lib/encoding_detector.rb', line 52 def self.test_encoding(content, encoding) # Try to encode the content to UTF-8 using the specified encoding encoded_content = content.dup.force_encoding(encoding) .encode('UTF-8', invalid: :replace, undef: :replace, replace: INVALID_CHARACTER) score = score_encoding(encoded_content, encoding) inverted_score = score[:inverted_score] { encoding:, inverted_score: } end |