Module: EncodingDetector

Defined in:
lib/encoding_detector.rb

Overview

Detect encodings of arbitrary strings for later conversion.

This is a naive implementation that works by trying to encode the string to UTF-8 using a list of known encodings, and scoring the results based on the number of invalid characters and total length of the resulting string. The shorter the resulting string, the more likely it is to be valid. Invalid characters are penalised more heavily to avoid false positives and assist with UTF-16 detection (which produces shorter strings).

Constant Summary collapse

ENCODINGS =

A later encoding will ONLY be chosen IF it scores better than an earlier one

%w[UTF-8 ISO-8859-1 UTF-16].freeze
MAX_SCORE =
Float::INFINITY
INVALID_CHARACTER =
''

Class Method Summary collapse

Class Method Details

.convert_to_utf8(content) ⇒ String

Convert the given content to UTF-8, using the detected encoding if possible. If the encoding cannot be detected, invalid characters will be replaced with the unicode replacement character (�).

Parameters:

  • content (String)

    The string to be converted to UTF-8

Returns:

  • (String)

    The content converted to UTF-8, with invalid characters replaced



68
69
70
71
72
73
74
75
76
77
# File 'lib/encoding_detector.rb', line 68

def self.convert_to_utf8(content)
  detection = detect(content)
  if detection[:inverted_score] < MAX_SCORE
    content.force_encoding(detection[:encoding]).encode('UTF-8', invalid: :replace, undef: :replace,
                                                                 replace: INVALID_CHARACTER)
  else
    # If detection fails, replace unknown characters with a unicode replacement character
    content.encode('UTF-8', invalid: :replace, undef: :replace, replace: INVALID_CHARACTER)
  end
end

.detect(content) ⇒ Hash

Detect the encoding of the given content from the list of known encodings.

Parameters:

  • content (String)

    The string whose encoding is to be detected

Returns:

  • (Hash)

    A hash containing the detected encoding and inverted_score level



22
23
24
25
26
27
28
29
# File 'lib/encoding_detector.rb', line 22

def self.detect(content)
  best_guess = { inverted_score: MAX_SCORE, encoding: 'UNKNOWN' }
  ENCODINGS.each do |encoding|
    result = test_encoding(content, encoding)
    best_guess = result if result[:inverted_score] < best_guess[:inverted_score]
  end
  best_guess
end

.score_encoding(content, encoding) ⇒ Object



31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# File 'lib/encoding_detector.rb', line 31

def self.score_encoding(content, encoding)
  return { inverted_score: MAX_SCORE, encoding: encoding } if content.blank?

  # The shorter the resulting string, the more likely it is to be valid, unescaped text adds characters
  total_chars = content.length

  # Invalid characters reduce inverted_score
  invalid_chars = content.count(INVALID_CHARACTER)

  # Calculate a score based on the above
  inverted_score = (total_chars + (invalid_chars * 10)) # penalise invalid chars more

  # Return the score as a hash
  {
    inverted_score: inverted_score,
    encoding: encoding,
    _total_chars: total_chars,
    _invalid_chars: invalid_chars
  }
end

.test_encoding(content, encoding) ⇒ Object



52
53
54
55
56
57
58
59
60
# File 'lib/encoding_detector.rb', line 52

def self.test_encoding(content, encoding)
  # Try to encode the content to UTF-8 using the specified encoding
  encoded_content = content.dup.force_encoding(encoding)
    .encode('UTF-8', invalid: :replace, undef: :replace, replace: INVALID_CHARACTER)

  score = score_encoding(encoded_content, encoding)
  inverted_score = score[:inverted_score]
  { encoding:, inverted_score: }
end