shell - Print out the distribution of words in multiple files -
shell - Print out the distribution of words in multiple files -
i trying create executable take in number of text files , give output distribution of words number of occurrences. done in bash scripting, , have far is:
#!/bin/bash y=$(cat $* | wc -w) cat $* | tr ' ' '//' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]'| sort | uniq -c | sort -rn | head -$y i error trying set y , can't figure out how head print out every word otherwise.
is there improve way print out?
why run head @ all? there's no guarantee there many words there words in files; indeed, practically guaranteed there won't (since there'll repeated words). , if want see data, show data; don't filter output sort -nr.
the first tr needs 1 slash, think. normally, you'd map blanks , punctuation newlines (with -s alternative tr squeeze adjacent newlines one). slashes first tr count punctuation in 3rd tr, isn't obvious you're there. think i'd expect see like:
cat "$@" | tr -cs '[:alpha:]' '\n' | # convert non-alpha character newline tr '[:upper:]' '[:lower:]' | # case-convert lower case sort | uniq -c | sort -nr note utilize of "$@" rather $*; there's no difference when file names specify don't contain blanks (newlines, tabs, etc); when do, "$@" form right , $* not, may utilize "$@". right far more $* is.
for c source code had lying around, output script was:
246 n 217 153 int 141 list 124 if 118 t 103 char 99 97 size 90 buffer 89 context 82 d 81 void 79 include 79 h 78 s 65 62 j 55 ptr 54 r 54 const 53 static 53 sem 51 pthread 49 z 49 oldneedle 49 err 47 47 homecoming 46 mutex 44 printf 43 error 43 c note word 'h' appears word 'include'; there's reason that! word t appears lot, that's because, example, size_t treated 2 words filtering. preserving underscores possible; alter first tr utilize '[:alpha:]_' (note underscore). eliminated digits, can maintain if want.
shell
Comments
Post a Comment