Big Data Public Data Area

File Vault

Hydraulic Fracturing Tweets
A collection of tweets that includes words related to Hydraulic fracturing from March 28, 2013 to June 7, 2013 can be found here. Searches on the terms "#fracking", "fracking", "hydrofracking", "#hydrofracking", "hydraulic fracturing" make up these tweets. Duplicates were removed from them. These are in .csv format for easy loading into Excel and other programs.
Ice Breaker Exercise for Students
Instructions for doing basic text analysis using the Vidia platform below. Students use this file to test their systems and be sure the understand the Rapid Miner environment on Vidia.
icebreaker assignment for RapidMiner--v3
26 March 14

PURPOSE

Load a process and a data file to your RapidMiner (RM) repository. Run the process to see RM perform some simple manipulations on the data. Both the input and the results
will display to your screen. The results will save to an output csv file.

We'll start with an input file that contains information about degrees offered at Oneonta. The RM process will load these data and select a small subset of them to return.

The RM process file expects specific repository locations and filenames within your own repository space. Please follow directions for naming files. Alternately, you can change
the filenames your process expects.

INSTRUCTIONS

0. Log on to your account on vidia.ccr.buffalo.edu

1. Start Rapid Miner

2. Import and save the RapidMiner analysis process file.

Select File: Import Process
specify file path and name:
/data/oneonta/icebreaker/oneonta-degrees-v3.xml
click Open

RM will load a process in the main window.

In the lower left-hand Repository pane:

Right click the Local Repository "processes" folder in the lower left hand Repository pane
Select Store Process Here and supply the name:
oneonta-degrees-v3

RM will store a copy of the process in your own repository. Verify that oneonta-degrees-v3 is listed under Local Repository processes.

3. Import the input file.

Under Repositories in the lower left hand Repository pane, click the Import icon (inbox with green arrow).
Select Import CSV file. (Alternately: Select File -> Import Data -> Import CSV file ...
from the File menu.)

Step 1: specify file path and name:
/data/oneonta/icebreaker/oneonta-ug-degrees.csv
Click Next

Step 2: Click Column Separation: Comma. The screen should display data with columns,
"Area", "Major", "B.A. Degree", "B.S. Degree", and "HEGIS Code"
Click Next.

Step 3: Click Next

Step 4: Select the data columns to import and indicate their types.

First, uncheck the box at the top of HEGIS Code. This column is will not be
imported into the repository.

Select an Attribute type from the drop down for each column (attribute),
as follows:
Area: nominal attribute
Major: text attribute
B.A. Degree: nominal attribute
B.S. Degree: nominal attribute
"Nominal" type means that only certain values exist for a column.

Click Next.

Step 5: Select where to import the data: Click "data" under "Local Repository".
Under Name, specify
degrees

Click Finish.

RM should display the Results View, with a tab showing the ExampleSet (csv data file) you just imported.

Click the Design View (paper and pencil icon). Verify that "degrees" is listed under Local Repository data.

4. Verify the input and output filenames in your process.

INPUT

In the Design View, with the process open, click on the box (operator) that is labeled "Retrieve degrees data". This is the leftmost operator in the workflow.

In the right hand parameters pane, you should see
repository entry: ../data/degrees

This refers to the input, degrees, that you just imported to your repository.
If you changed the name or location of the input file from those recommended in step 3,
you must match that name or location here in order to run the process successfully.

OUTPUT

In the Design View, with the process open, click on the box (operator) that is
labeled "Write Output csv". This is the rightmost operator in the workflow.

In the right hand parameters pane, you should see
csv file: icebreaker-results.csv

This refers to the output file, icebreaker-results.csv, that will be created by the process.
It will save into your home directory on vidia and can be downloaded to your own computer desktop using the WebDAV utility.

5. Understand the process
There are 5 boxes (operators) in this process workflow that accept inputs, make transformations on them, and then pass a result to the next operator in the workflow.
Inputs and outputs are supplied by connecting "ports" on the operators. Here is the summary of the functions that the operators perform in this process workflow:

1. Retrieve degrees data: Loads the input data from your own repository.
2. Duplicate Input: Makes two identical copies of the input file. One is sent to display in the results view, one is sent to operator #3.
3. Filter Data: Filters from the input only those degree programs that offer a B.A.,but not a B.S.
4. Select Data: Selects "B.A. Degree", "B.S. Degree", and Major from the filtered data.
5. Write Output CSV: Saves a copy of your results to the csv file icebreaker-results.csv in your home directory. It also sends the results to display in the results view.

Notice that one "wire" in the workflow pipes output from the Duplicate operator to the right hand side of the Process view, which allows RapidMiner to display the output in the Results View. Another "wire" does the same from the "Write Output CSV" operator.

6. Run the result
Click the Play (triangle) icon. RapidMiner will run your process on your dataset.
You can see green lights appear in each operator as it completes successfully.

In the results view, RapidMiner displays the results of running the process. You will see three tabs:
* Result Overview
* ExampleSet (Select Data)
* ExampleSet (Duplicate Input)

The tab ExampleSet (Select Data) displays the results of running the process.
This ExampleSet contains degree programs that offer a B.A. degree, but not a B.S. degree. You should see six such degree programs listed, each with a Row No.,
Major, B.A. Degree, and B.S. Degree column. What are the 6 Majors that RM found?

The tab ExampleSet (Duplicate Input) displays the input you sent to the process.
How many columns are there? How many rows?

The tab Result Overview displays information about the time RM took to execute the process.

Explore the tabs in the results view, and click the Meta Data View buttons on the ExampleSets. Notice the Types that you set in step 3.

If you have changed anything about your process since importing it, save it now.

7. Use WebDAV to download the result CSV generated by running the process. You should find the file saved in your home directory and named icebreaker-results.csv.
Word Count Analysis Files
Instructions and sample data files for doing word count analysis on text. Students use these files to setup a word count analysis on Vidia which they can then substitute their own data files for. Right click as Save Link As to download each of these files. These files can also be found on Vidia in the /data/oneonta directory.
This XML file does not appear to have any style information associated with it. The document tree is shown below. Download Rapid Miner Process for Word Count Analysis XML

<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
<description>
Generate word list of terms with frequency dictated by Process Documents settings "prune below" and "prune above". Upper limit can be adjusted to include/exclude the leading search term which may occur very frequently. Process Documents options --------------------------------------------- stemming operator can be enabled n-grams are computed. set to 2 or 3. Discarding all text connected to URLs, entity tags, hashtags. Output ----------- Output is .csv word list with frequencies (by document and by total) Output can be used to generate word cloud (wordle.net) Plotting ------------ From results pane, plot Word List output as pie, bar, scatterplot.
</description>
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
<parameter key="repository_entry" value="../../data/fracking-example-trackur"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.3.013" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="75">
<description>
Enable this operator only if using a Trackur dataset.
</description>
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="Media Type=Twitter"/>
<parameter key="invert_filter" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75">
<description>
Generate a word list of common terms from your dataset, with counts. The range of counts reported in the results are dictated by "prune below" and "prune above" parameters. * "prune above" determines the largest count that will be reported, and can be adjusted to include/exclude the leading search term(s). * "prune below" determines the smallest count that will be reported, and can be adjusted upwards to focus on more important words in your dataset. Once you adjust pruning, enable these operators: "Stem (Porter)" operator: For instance, transform "fracking" and "fracks" to "frack" "Generate n-grams" operator: Count words that frequently occur together. Set to 2 or 3. Output Word List can be used to generate word cloud (wordle.net)
</description>
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="10"/>
<parameter key="prune_above_absolute" value="9000"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30">
<parameter key="mode" value="regular expression"/>
<parameter key="characters" value=".:"/>
<parameter key="expression" value="\s"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="179" y="30">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="313" y="30"/>
<operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Replace Tokens (3)" width="90" x="447" y="30">
<description>
Replace hashtags, entities, and urls with constant strings
</description>
<list key="replace_dictionary">
<parameter key="^http.+" value="URL"/>
<parameter key="@.+" value="ENTITY"/>
<parameter key="#.+" value="HASHTAG"/>
</list>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (5)" width="90" x="45" y="120">
<description>Delete all URL tokens</description>
<parameter key="condition" value="equals"/>
<parameter key="string" value="URL"/>
<parameter key="regular_expression" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (6)" width="90" x="179" y="120">
<description>Delete all ENTITY tokens</description>
<parameter key="condition" value="equals"/>
<parameter key="string" value="ENTITY"/>
<parameter key="regular_expression" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (4)" width="90" x="313" y="120">
<description>Delete all HASHTAG tokens</description>
<parameter key="condition" value="equals"/>
<parameter key="string" value="HASHTAG"/>
<parameter key="regular_expression" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="45" y="210">
<description>Delete all tokens that match html markup</description>
<parameter key="condition" value="matches"/>
<parameter key="string" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="regular_expression" value="^&.+"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:remove_document_parts" compatibility="5.3.002" expanded="true" height="60" name="Remove Document Parts" width="90" x="179" y="210">
<description>
This operator removes quote marks around a given token.
</description>
<parameter key="deletion_regex" value="^\"|\"$"/>
</operator>
<operator activated="true" class="text:remove_document_parts" compatibility="5.3.002" expanded="true" height="60" name="Remove Document Parts (2)" width="90" x="313" y="210">
<description>Delete stray punctuation from tokens</description>
<parameter key="deletion_regex" value="[.:;,?!\'\"\$]+"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (7)" width="90" x="447" y="210">
<description>Delete tokens that end in numbers</description>
<parameter key="condition" value="matches"/>
<parameter key="string" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="regular_expression" value="^([a-z])*+[0-9.\/-]$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="581" y="210">
<description>Delete tokens that start with numbers</description>
<parameter key="condition" value="matches"/>
<parameter key="string" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="regular_expression" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (3)" width="90" x="447" y="345">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="30"/>
</operator>
<operator activated="false" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="581" y="435"/>
<operator activated="false" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="581" y="525">
<parameter key="max_length" value="2"/>
</operator>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Replace Tokens (3)" to_port="document"/>
<connect from_op="Replace Tokens (3)" from_port="document" to_op="Filter Tokens (5)" to_port="document"/>
<connect from_op="Filter Tokens (5)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
<connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
<connect from_op="Filter Tokens (4)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Remove Document Parts" to_port="document"/>
<connect from_op="Remove Document Parts" from_port="document" to_op="Remove Document Parts (2)" to_port="document"/>
<connect from_op="Remove Document Parts (2)" from_port="document" to_op="Filter Tokens (7)" to_port="document"/>
<connect from_op="Filter Tokens (7)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
<connect from_op="Filter Tokens (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="5.3.002" expanded="true" height="76" name="WordList to Data" width="90" x="447" y="165"/>
<operator activated="true" class="write_csv" compatibility="5.3.013" expanded="true" height="76" name="Write CSV" width="90" x="447" y="300">
<parameter key="csv_file" value="/home/jsperhac/resources/oneonta-NSF/assignments/fracking-example/test.csv"/>
<parameter key="column_separator" value=","/>
<parameter key="write_attribute_names" value="true"/>
<parameter key="quote_nominal_values" value="true"/>
<parameter key="format_date_attributes" value="true"/>
<parameter key="append_to_file" value="false"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Write CSV" to_port="input"/>
<connect from_op="Write CSV" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
K-Means Clustering Analysis Files
Instructions and a Rapid Miner process file for demonstrating clustering. The instructions point users to a sample data file, but any file can be used. Right click and select Save File As to download these files. These files can also be found on Vidia in the /data/oneonta directory.
This XML file does not appear to have any style information associated with it. The document tree is shown below. Download Rapid Miner Process Code for Clustering XML

<process version="5.3.013">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
<description>
Note: k-means clustering seems faster than Data to Similarity, and as informative.
</description>
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="5.3.013" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="../../data/fracking-test/fracking-test"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="5.3.013" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="30">
<parameter key="condition_class" value="attribute_value_filter"/>
<parameter key="parameter_string" value="Media Type=Twitter"/>
<parameter key="invert_filter" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="165">
<description>
PROCESS DOCUMENTS PRUNING: Prune results aggressively so only frequently-occurring terms are reported. Exclude the stem "frack" so that most-commonly appearing term in dataset is suppressed. SOME PROCESS DETAILS: Strip all text associated with URLs, entities (@), and hashtags (#). Keep tokens 3 to 25 characters Perform word stemming
</description>
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="149"/>
<parameter key="prune_above_absolute" value="900"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30">
<parameter key="mode" value="regular expression"/>
<parameter key="characters" value=".:"/>
<parameter key="expression" value="\s"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="179" y="30">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="313" y="30"/>
<operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Replace Tokens (3)" width="90" x="447" y="30">
<list key="replace_dictionary">
<parameter key="^http.+" value="URL"/>
<parameter key="@.+" value="ENTITY"/>
<parameter key="#.+" value="HASHTAG"/>
</list>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="45" y="120">
<parameter key="condition" value="matches"/>
<parameter key="string" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="regular_expression" value="^&.+"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:remove_document_parts" compatibility="5.3.002" expanded="true" height="60" name="Remove Document Parts" width="90" x="179" y="120">
<parameter key="deletion_regex" value="^\"|\"$"/>
</operator>
<operator activated="true" class="text:remove_document_parts" compatibility="5.3.002" expanded="true" height="60" name="Remove Document Parts (2)" width="90" x="313" y="120">
<parameter key="deletion_regex" value="[.:;,?!\'\$]+"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="447" y="120">
<parameter key="condition" value="matches"/>
<parameter key="string" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="regular_expression" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (5)" width="90" x="45" y="210">
<parameter key="condition" value="equals"/>
<parameter key="string" value="URL"/>
<parameter key="regular_expression" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (6)" width="90" x="179" y="210">
<parameter key="condition" value="equals"/>
<parameter key="string" value="ENTITY"/>
<parameter key="regular_expression" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (4)" width="90" x="313" y="210">
<parameter key="condition" value="equals"/>
<parameter key="string" value="HASHTAG"/>
<parameter key="regular_expression" value="^[0-9.\/-]+([a-z])*$"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (3)" width="90" x="447" y="255">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="25"/>
</operator>
<operator activated="false" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="345"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Replace Tokens (3)" to_port="document"/>
<connect from_op="Replace Tokens (3)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Remove Document Parts" to_port="document"/>
<connect from_op="Remove Document Parts" from_port="document" to_op="Remove Document Parts (2)" to_port="document"/>
<connect from_op="Remove Document Parts (2)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Tokens (5)" to_port="document"/>
<connect from_op="Filter Tokens (5)" from_port="document" to_op="Filter Tokens (6)" to_port="document"/>
<connect from_op="Filter Tokens (6)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
<connect from_op="Filter Tokens (4)" from_port="document" to_op="Filter Tokens (3)" to_port="document"/>
<connect from_op="Filter Tokens (3)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="k_means" compatibility="5.3.013" expanded="true" height="76" name="Clustering" width="90" x="447" y="165">
<description>
For clustering of text data, need Cosine Similarity clustering. Try small k.
</description>
<parameter key="add_cluster_attribute" value="false"/>
<parameter key="add_as_label" value="false"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k" value="3"/>
<parameter key="max_runs" value="10"/>
<parameter key="determine_good_start_values" value="false"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="SquaredEuclideanDistance"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="correlation_matrix" compatibility="5.3.013" expanded="true" height="94" name="Correlation Matrix" width="90" x="581" y="165">
<parameter key="create_weights" value="false"/>
<parameter key="normalize_weights" value="true"/>
<parameter key="squared_correlation" value="false"/>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="5.3.002" expanded="true" height="76" name="WordList to Data" width="90" x="447" y="300"/>
<operator activated="true" class="write_csv" compatibility="5.3.013" expanded="true" height="76" name="Write CSV" width="90" x="581" y="300">
<parameter key="csv_file" value="resources/oneonta-NSF/assignments/fracking-example/fracking-example-stemming-wordlist.csv"/>
<parameter key="column_separator" value=","/>
<parameter key="write_attribute_names" value="true"/>
<parameter key="quote_nominal_values" value="true"/>
<parameter key="format_date_attributes" value="true"/>
<parameter key="append_to_file" value="false"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 2"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Correlation Matrix" to_port="example set"/>
<connect from_op="Correlation Matrix" from_port="example set" to_port="result 3"/>
<connect from_op="Correlation Matrix" from_port="matrix" to_port="result 4"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Write CSV" to_port="input"/>
<connect from_op="Write CSV" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>

Creative Commons

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Public Data Area

File Vault

Hydraulic Fracturing Tweets

Ice Breaker Exercise for Students

Word Count Analysis Files

K-Means Clustering Analysis Files

Creative Commons

Grant Funding

Collaborators