Module: EncodingDetector

Defined in:
lib/encoding_detector.rb

Overview

Detect encodings of arbitrary strings for later conversion.

This is a naive implementation that works by trying to encode the string to UTF-8 using a list of known encodings, and scoring the results based on the number of invalid characters and total length of the resulting string. The shorter the resulting string, the more likely it is to be valid. Invalid characters are penalised more heavily to avoid false positives and assist with UTF-16 detection (which produces shorter strings).

Constant Summary collapse

ENCODINGS =

A later encoding will ONLY be chosen IF it scores better than an earlier one

%w[UTF-8 ISO-8859-1 UTF-16].freeze
MAX_SCORE =
Float::INFINITY
INVALID_CHARACTER =
''

Class Method Summary collapse

Class Method Details

.convert_to_utf8(content) ⇒ String

Convert the given content to UTF-8, using the detected encoding if possible. If the encoding cannot be detected, invalid characters will be replaced with the unicode replacement character (�).

Parameters:

  • content (String)

    The string to be converted to UTF-8

Returns:

  • (String)

    The content converted to UTF-8, with invalid characters replaced



68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/encoding_detector.rb', line 68

 do
  # NOTE: The following attribute is not required for Microarray Genotyping.
  # I think this might be broken and suggests that there should be separate classes for project: one for
  # next-gen sequencing that includes this attribute in it's metadata, and one for microarray genotyping
  # that doesn't.
  include ProjectManager::Associations
  include BudgetDivision::Associations

  custom_attribute(:project_cost_code, required: true)
  custom_attribute(:funding_comments)
  custom_attribute(:collaborators)
  custom_attribute(:external_funding_source)
  custom_attribute(:sequencing_budget_cost_centre)
  custom_attribute(:project_funding_model, in: PROJECT_FUNDING_MODELS)
  custom_attribute(:gt_committee_tracking_id)

  before_validation do |record|
    record.project_cost_code = nil if record.project_cost_code.blank?
    record.project_funding_model = nil if record.project_funding_model.blank?
  end
end

.detect(content) ⇒ Hash

Detect the encoding of the given content from the list of known encodings.

Parameters:

  • content (String)

    The string whose encoding is to be detected

Returns:

  • (Hash)

    A hash containing the detected encoding and inverted_score level



22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# File 'lib/encoding_detector.rb', line 22

 do
  # NOTE: The following attribute is not required for Microarray Genotyping.
  # I think this might be broken and suggests that there should be separate classes for project: one for
  # next-gen sequencing that includes this attribute in it's metadata, and one for microarray genotyping
  # that doesn't.
  include ProjectManager::Associations
  include BudgetDivision::Associations

  custom_attribute(:project_cost_code, required: true)
  custom_attribute(:funding_comments)
  custom_attribute(:collaborators)
  custom_attribute(:external_funding_source)
  custom_attribute(:sequencing_budget_cost_centre)
  custom_attribute(:project_funding_model, in: PROJECT_FUNDING_MODELS)
  custom_attribute(:gt_committee_tracking_id)

  before_validation do |record|
    record.project_cost_code = nil if record.project_cost_code.blank?
    record.project_funding_model = nil if record.project_funding_model.blank?
  end
end

.score_encoding(content, encoding) ⇒ Object



31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/encoding_detector.rb', line 31

 do
  # NOTE: The following attribute is not required for Microarray Genotyping.
  # I think this might be broken and suggests that there should be separate classes for project: one for
  # next-gen sequencing that includes this attribute in it's metadata, and one for microarray genotyping
  # that doesn't.
  include ProjectManager::Associations
  include BudgetDivision::Associations

  custom_attribute(:project_cost_code, required: true)
  custom_attribute(:funding_comments)
  custom_attribute(:collaborators)
  custom_attribute(:external_funding_source)
  custom_attribute(:sequencing_budget_cost_centre)
  custom_attribute(:project_funding_model, in: PROJECT_FUNDING_MODELS)
  custom_attribute(:gt_committee_tracking_id)

  before_validation do |record|
    record.project_cost_code = nil if record.project_cost_code.blank?
    record.project_funding_model = nil if record.project_funding_model.blank?
  end
end

.test_encoding(content, encoding) ⇒ Object



52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/encoding_detector.rb', line 52

 do
  # NOTE: The following attribute is not required for Microarray Genotyping.
  # I think this might be broken and suggests that there should be separate classes for project: one for
  # next-gen sequencing that includes this attribute in it's metadata, and one for microarray genotyping
  # that doesn't.
  include ProjectManager::Associations
  include BudgetDivision::Associations

  custom_attribute(:project_cost_code, required: true)
  custom_attribute(:funding_comments)
  custom_attribute(:collaborators)
  custom_attribute(:external_funding_source)
  custom_attribute(:sequencing_budget_cost_centre)
  custom_attribute(:project_funding_model, in: PROJECT_FUNDING_MODELS)
  custom_attribute(:gt_committee_tracking_id)

  before_validation do |record|
    record.project_cost_code = nil if record.project_cost_code.blank?
    record.project_funding_model = nil if record.project_funding_model.blank?
  end
end