Training of ML Model
Manual classification of records is done to train the ML model and achieve higher accuracy with every scan. The basic idea is to go through every record and verify if the machine has classified the record correctly, if not we update it manually. This process is done every time we do a scan for any share of any client.
The steps to complete this process is described below:
Method 1: Manual Method
Step 1:
After scan, you will be provided with a CSV file.
The file will look something like this:
Step 2: (Optional Step)
Open a Blank Excel and import the csv in excel. You can do this by selecting ‘Get data from text/csv’ in the ‘Data’ menu.
In the dialog box choose the File Origin as - UFT 8, so that all the characters are visible clearly. Note: We have clients all over the world, so we get documents in many languages
After loading the excel will look as below:
Save the new excel as a workbook.
Step 3: (Optional Step)
Remove these columns from the excel
File Length
Last Moved/Created Date
Last Modified Date
Category Confidence
Subcategory Confidence
Pii Confidence
Classification
Success Code
MD5 (matching values are duplicates)
After the columns are removed the excel will look like this:
Step 4:
Enable the Filter by selecting ‘Filter’ in the ‘Data’ menu
Using the File Type Column, filter out all file types except for these:
Csv
Doc/Docx/Docm (Any other variation related to word doc)
Pdf
Ppt/pptx (Any other variation related to office ppt)
Xls/xlsb/xlsm/xlsx (Any other variation related to office excel)
After the extra file types are removed
Step 5:
Add the columns mentioned below in the excel:
Department/MainShare
File Checked Y/N
File Value changed Y/N
Validate Date
Previewed (Y/N)
Once the columns are added, sort the columns in the following order:
Department/MainShare
Path
Category Name
SubCategory
File Checked Y/N
File Value changed Y/N
Validate Date
Previewed (Y/N)
Contains PII
FileType
The excel will now look like this:
Step 6:
Below is a detailed explanation of the columns:
Column Name | Pre-Populated | Description |
Department/MainShare | No – Filled by QA Analyst | The department or share for which scan is done. Eg: Path: /mnt/Extract$/Central Operations/, QA analyst will fill ClientName-Extract in the column, where Extract is share/department. |
Path | Yes | Complete path of where the document is kept. |
Category Name | Yes | The category name that is assigned to the record. Eg: HR Data. |
SubCategory | Yes | The subcategory that is assigned to the record. Eg: Timesheet. |
File Checked Y/N | No – Filled by Analyst | If the analyst has validated the record, then column populated as Y else N |
File Value changed Y/N | No – Filled by Analyst | If the analyst has changed either the category or subcategory of the record, then column is populated as Y else N |
Validate Date | No – Filled by Analyst | Date on which the record was validated |
Previewed (Y/N) | No – Filled by Analyst | If the analyst has seen the original document, the column is populated as Y else N |
Contains PII | Yes | If the record contains any form of Personal Identifiable information column is t (True) else f (False). |
FileType | Yes | File type of the document. |
Step 7:
After the basic clean-up is done, the file is ready for classification. The analyst will go through each record one by one and populate the columns. The basic task here is to look at the Category and Subcategory of the record and validate if the machine has classified the file correctly, if not the analyst must change it.
Scenario 1: Preview is available on the below screen, when you click on the path link the document will be visible
Look at the file path below, based on the file name we cannot be sure if the Category Name: Business_Documents and Subcategory: Audit is right or wrong.
Use the preview option to view the file and if after viewing you can say that the document is a business document related to Audit, no change is required to the Category and Subcategory Column.
Update the File Checked: Y and File Value Changed: N
But if after looking at the document you think that the machine has NOT assigned the right category you can change it. In this case I decided after viewing the document that it is an invoice so I will change the Category Name: Finance_Documents and Subcategory: Invoice/Purchase Order
Also update the File Checked: Y and File Value Changed: Y
Scenario 2: If Preview is NOT available:
When the above screen is not available (for some reason), the analyst must use their best judgement from what they can discern from the file name and location.
If we take the same EG as above, we can see that the file is stored in a folder of invoices, so I would change the category and subcategory to reflect as invoice.
It is important to remain consistent with yourself while judging categorization. If the analyst marks one file as Note/Log, they should make sure to always categorize similar files as Note/Log. This is more important than being perfectly accurate.
Step 8:
After above steps are completed send the updated CSV file to Getvisibility team and they will use the inputs to train the model for better accuracy and overall results.
Method 2: Online Method
Using Portal to train the model.
Navigation: Login into the application Landing Verify
After you click verify below screen for Manual Classification of Files will be displayed:
Click on the path link to view the documents.
Based on the document content change the Category or sub-category or PII
Click on Update button.
The new values will be stored which can be used for training of ML model.
Related content
Classified as Getvisibility - Partner/Customer Confidential