Disclaimer: The statistics provided are generated through a combination of direct calculation and heuristic models. They are intended to provide a useful and insightful overview of the text, but their accuracy may vary depending on the specific nature of the input.
Welcome to our Text Analyzer's documentation! Here, we'll delve into the methodologies and algorithms that power our tool, providing transparency on how various text statistics are derived. Whether you're curious about word counts, reading times, or keyword extraction, this section offers a comprehensive look "under the hood."
The foundational metrics of our analyzer are calculated with straightforward, yet precise,
methods. The Word Count is derived by splitting the input text using the
regular expression /\s+/
, which accurately tokenizes words by accounting for
spaces, tabs, and newlines. The Character Count is a direct measure of the
string's length (text.length
).
Paragraphs are estimated by splitting the text by double newlines (/\n\n+/
).
While this is an estimation, it provides a reliable metric for well-structured texts.
To analyze vocabulary richness, we calculate the number of Unique Words. This
involves a multi-step process: the text is lowercased, and then punctuation is stripped from
the start and end of each word. These normalized words are then added to a Set
data
structure to efficiently store and count the unique entries.
The Top Keywords are identified by first normalizing the text in the same manner. We then filter out a predefined list of common "stop words" (e.g., "the", "is", "a"). The frequency of the remaining words is tallied in a hash map, and the top 5 are selected to represent the text's primary themes.
Our time-based estimations leverage established models of human reading and speaking speeds. Reading Time is calculated using the industry-standard model of 200 words per minute (WPM). For Speaking Time, we use a more conservative 130 WPM, which better reflects the pace of public speaking.
The Sentence Count is another estimation, this time based on a regular
expression that counts the occurrences of terminal punctuation (.
, !
, ?
). This method is fast and effective, though it may not be
perfectly accurate in the face of complex sentence structures or abbreviations.