ℹ️ Select 'Choose Exercise', or randomize 'Next Random Exercise' in selected language.

Choose Exercise:
Timer 00:00
WPM --
Score --
Acc --
Correct chars --

Perl Text Processing: Word Frequency Counter

Perl

Goal -- WPM

Ready
Exercise Algorithm Area
1package Text::WordFrequency;
2
3use strict;
4use warnings;
5
6# Common English stop words to ignore
7my %stop_words = (
8'a' => 1, 'an' => 1, 'the' => 1, 'is' => 1, 'it' => 1, 'in' => 1, 'on' => 1,
9'of' => 1, 'and' => 1, 'to' => 1, 'for' => 1, 'with' => 1, 'by' => 1,
10'this' => 1, 'that' => 1, 'be' => 1, 'are' => 1, 'was' => 1, 'were' => 1,
11'i' => 1, 'you' => 1, 'he' => 1, 'she' => 1, 'it' => 1, 'we' => 1, 'they' => 1,
12'my' => 1, 'your' => 1, 'his' => 1, 'her' => 1, 'its' => 1, 'our' => 1, 'their' => 1,
13'me' => 1, 'him' => 1, 'us' => 1, 'them' => 1,
14'at' => 1, 'from' => 1, 'about' => 1, 'as' => 1, 'so' => 1, 'if' => 1,
15'or' => 1, 'but' => 1, 'not' => 1,
16'what' => 1, 'when' => 1, 'where' => 1, 'why' => 1, 'how' => 1,
17'all' => 1, 'any' => 1, 'both' => 1, 'each' => 1, 'few' => 1, 'more' => 1,
18'most' => 1, 'other' => 1, 'some' => 1, 'such' => 1,
19'no' => 1, 'nor' => 1,
20'only' => 1, 'own' => 1, 'same' => 1, 'than' => 1,
21'too' => 1, 'very' => 1,
22'can' => 1, 'will' => 1, 'just' => 1, 'don' => 1, 't' => 1, 's' => 1, 're' => 1, 'll' => 1, 've' => 1, 'm' => 1
23);
24
25sub new {
26my ($class) = @_;
27my $self = bless {}, $class;
28$self->{word_counts} = {};
29return $self;
30}
31
32sub process_text {
33my ($self, $text) = @_;
34
35# Convert to lowercase and remove punctuation
36$text = lc($text);
37$text =~ s/[^a-z\s]//g; # Keep only letters and spaces
38
39# Split text into words
40my @words = split /\s+/, $text;
41
42# Count word frequencies, ignoring stop words
43foreach my $word (@words) {
44next if $word eq ''; # Skip empty strings resulting from multiple spaces
45next if exists $stop_words{$word};
46$self->{word_counts}{$word}++;
47}
48}
49
50sub get_word_counts {
51my ($self) = @_;
52return $self->{word_counts};
53}
54
55sub display_counts {
56my ($self) = @_;
57my $counts = $self->get_word_counts();
58
59# Sort words alphabetically for display
60my @sorted_words = sort keys %$counts;
61
62print "Word Frequencies:\n";
63foreach my $word (@sorted_words) {
64print " - $word: ", $counts->{$word}, "\n";
65}
66}
67
68# Example Usage:
69# my $counter = Text::WordFrequency->new();
70# my $sample_text = "This is a sample text. This text is for testing the word frequency counter.";
71# $counter->process_text($sample_text);
72# $counter->display_counts();
Algorithm description viewbox

Perl Text Processing: Word Frequency Counter

Algorithm description:

This Perl module, `Text::WordFrequency`, calculates the frequency of words in a given text. It preprocesses the text by converting it to lowercase and removing punctuation, then splits it into individual words. A predefined list of common English stop words is used to filter out irrelevant terms. The script uses a hash map (`%word_counts`) to store and increment the count for each unique word encountered. Finally, it provides methods to retrieve the counts and display them in a sorted, readable format. This is a fundamental text processing task used in natural language processing, search engines, and document analysis.

Algorithm explanation:

The `Text::WordFrequency` module processes text to count word occurrences. The `new` method initializes an empty hash `$self->{word_counts}` to store the frequencies. The `process_text` method takes the input string, converts it to lowercase using `lc()`, and then uses a regex `s/[^a-z\s]//g` to remove any characters that are not lowercase letters or whitespace. This effectively strips punctuation. The cleaned text is then split into an array of words using `split /\s+/`. The code iterates through these words; if a word is not empty and not present in the `%stop_words` hash, its count is incremented in `$self->{word_counts}`. The `get_word_counts` method returns this hash. The `display_counts` method sorts the words alphabetically and prints them with their frequencies. Time complexity is dominated by text cleaning and splitting, which is roughly O(L) where L is the text length. Iterating through words and updating the hash is O(W) where W is the number of words, with hash operations being amortized O(1). Thus, the overall time complexity is O(L + W). Space complexity is O(U) where U is the number of unique words, to store the counts.

Pseudocode:

Module Text::WordFrequency:
  Global stop_words (hash).

  Method new():
    Initialize word_counts (empty hash).

  Method process_text(text):
    Convert text to lowercase.
    Remove all characters except letters and spaces from text.
    Split text into words array using whitespace as delimiter.

    For each word in words array:
      If word is empty, continue.
      If word is in stop_words, continue.
      Increment word_counts[word].

  Method get_word_counts():
    Return word_counts.

  Method display_counts():
    Get word_counts.
    Sort keys of word_counts alphabetically.
    Print "Word Frequencies:".
    For each sorted_word:
      Print "  - " + sorted_word + ": " + word_counts[sorted_word].