FOSSology  3.2.0rc1
Open Source License Compliance by Open Source Software
licenses.c File Reference

utilities to scan, score and save license found data More...

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include <signal.h>
#include <libgen.h>
#include <limits.h>
#include <stdlib.h>
#include "nomos.h"
#include "licenses.h"
#include "nomos_utils.h"
#include "util.h"
#include "list.h"
#include "nomos_regex.h"
#include "parse.h"
#include "_autodefs.h"
Include dependency graph for licenses.c:

Go to the source code of this file.

Macros

#define _GNU_SOURCE
 
#define HASHES   "#####################"
 
#define DEBCPYRIGHT   "debian/copyright"
 
#define MAX(a, b)   ((a) > (b) ? a : b)
 Max of two.
 
#define MIN(a, b)   ((a) < (b) ? a : b)
 Min of two.
 
#define LINE_BYTES   50
 
#define LINE_WORDS   8
 
#define WC_BYTES   30
 
#define WC_WORDS   3
 
#define PUNT_LINES   3
 
#define MIN_LINES   1
 

Functions

static void makeLicenseSummary (list_t *l, int highScore, char *target, int size)
 Construct a 'computed license'. Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here. More...
 
static void noLicenseFound ()
 Mark curent scan as LS_NOSUM (No_license_found)
 
static int searchStrategy (int, char *, int)
 
static void saveLicenseData (scanres_t *scores, int nCand, int nElem, int lowWater)
 Save/creates all the license-data in a specific directory temp directory? More...
 
static int scoreCompare (const void *arg1, const void *arg2)
 Compare two scores. More...
 
static void printHighlightInfo (GArray *keyWords, GArray *theMatches)
 Print highlight info about matches. More...
 
void licenseInit ()
 license initialization More...
 
char * createRelativePath (item_t *p, scanres_t *scp)
 
void scanForKeywordsAndSetScore (scanres_t *scores, list_t *licenseList)
 
void relaxScoreCriterionForSingleFile (scanres_t *scores)
 Reset scores to 1 if it is 0. More...
 
int fiterResultsOfKeywordScan (int lowWater, scanres_t *scores, int nFiles)
 Run through the list once more. More...
 
void licenseScan (list_t *licenseList)
 scan the list for a license(s) More...
 
static void printKeyWordMatches (scanres_t *scores, int idx)
 Prints keywords match to STDOUT.
 
static gint compare_integer (gconstpointer a, gconstpointer b)
 Compare two integers. More...
 
static void rescanOriginalTextForFoundLicences (char *textp, int isFileMarkupLanguage, int isPS)
 Rescan original content for the licenses already found. More...
 

Variables

static char any [6]
 
static char some [7]
 
static char few [6]
 
static char year [7]
 

Detailed Description

utilities to scan, score and save license found data

Version
"$Id: licenses.c 4032 2011-04-05 22:16:20Z bobgo $"

Definition in file licenses.c.

Macro Definition Documentation

#define LINE_BYTES   50

fudge for punctuation, etc.

Definition at line 257 of file licenses.c.

#define LINE_WORDS   8

assume this many words per line

Definition at line 258 of file licenses.c.

#define MIN_LINES   1

normal minimum-extra-lines

Definition at line 262 of file licenses.c.

#define PUNT_LINES   3

if "dunno", guess this line-count

Definition at line 261 of file licenses.c.

#define WC_BYTES   30

wild-card counts this many bytes

Definition at line 259 of file licenses.c.

#define WC_WORDS   3

wild-card counts this many words

Definition at line 260 of file licenses.c.

Function Documentation

static gint compare_integer ( gconstpointer  a,
gconstpointer  b 
)
static

Compare two integers.

Returns
negative value if a < b; zero if a = b; positive value if a > b.

Definition at line 928 of file licenses.c.

int fiterResultsOfKeywordScan ( int  lowWater,
scanres_t scores,
int  nFiles 
)

Run through the list once more.

This time we record and count the license candidates to process. License candidates are determined by either (score >= low) OR matching a set of filename patterns.

Parameters
lowWaterLowest score to filter
scoresScores to filter
nFilesNumber of files
Returns

Definition at line 713 of file licenses.c.

void licenseInit ( )

license initialization

Examine the search strings in licSpec looking for 3 corner-cases to optimize all the regex-searches we'll be making:

  1. The seed string is the same as the text-search string
  2. The text-search string has length 1 and contents == "."
  3. The seed string is the 'null-string' indicator

Step 1, copy the tseed "search seed", decrypt it, and munge any wild- cards in the string. Note that once we eliminate the compile-time string encryption, we could re-use the same exact data. In fact, some day (in our copious spare time), we could effectively remove licSpec.

Step 2, add the search-seed to the search-cache

Step 3, handle special cases of NULL seeds and (regex == seed)

Step 4, decrypt and fix the regex (since seed != regex here). Once we have all that, searchStrategy() helps determine how many lines above and below [the seed] to save – see findPhrase() for details.

Now that we've computed the above- and below-values for license searches, set each of the appropriate entries with the MAX values determined. Limit 'above' values to 3 and 'below' values to 6.

QUESTION: the above has worked in the past - is it STILL valid?

Finally (if enabled), compare each of the search strings to see if there are duplicates, and determine if some of the regexes can be searched via strstr() (instead of it's slower-but-more-functional regex brethern).

Definition at line 82 of file licenses.c.

void licenseScan ( list_t licenseList)

scan the list for a license(s)

This routine takes a list, but in fossology we always pass in a single file.

Set up defaults for the minimum-scores for which we'll save files. Try to ensure a minimum # of license files will be recorded for this source/package (try, don't force it too hard); see if lower scores yield a better fit, but recognize the of a non-license file increases as we lower the bar.

Definition at line 759 of file licenses.c.

static void makeLicenseSummary ( list_t l,
int  highScore,
char *  target,
int  size 
)
static

Construct a 'computed license'. Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here.

parseLicenses() added license components found, as long as they were considered "interesting" to some extent. Components of significant interest had their iFlag set to 1; those of lower-interest were set to 0. In this way we can tier license components into 4 distinct levels: 'interesting', 'medium interest', 'nothing significant', and 'Zero'.
==> If the list is EMPTY, there's nothing, period.
==> If listCount() returns non-zero, "interesting" stuff is in it and we can safely ignore things of 'significantly less interest'.
==> If neither of these is the case, only the licenses of the above
'significantly less interest' category exist (don't ignore them).

We need to be VERY careful in this routine about the length of the license-summary created; they COULD be indefinitely long! For now, just check to see if we're going to overrun the buffer...

Construct a 'computed license'.

Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here.

Note
This function adds licenses to cur.compLic

Definition at line 1276 of file licenses.c.

static void printHighlightInfo ( GArray *  keyWords,
GArray *  theMatches 
)
static

Print highlight info about matches.

This functions prtints to STDOUT only if OPTS_HIGHLIGHT_STDOUT is set.

Format: Keyword at <start>, length <length>, index = 0, License #<name># at <start>, length <length>, index = <license_index>,

Parameters
keyWordsKeywords matches
theMatchesLicense matches

Definition at line 866 of file licenses.c.

void relaxScoreCriterionForSingleFile ( scanres_t scores)

Reset scores to 1 if it is 0.

If we were invoked with a single-file-only option, just over-ride the score calculation – give the file any greater-than-zero score so it appears as a valid candidate. This is important when the file to be evaluated has no keywords, yet might contain authorship inferences.

Parameters
scores
Note
It is always the case that we are doing one file at a time.

Definition at line 691 of file licenses.c.

static void rescanOriginalTextForFoundLicences ( char *  textp,
int  isFileMarkupLanguage,
int  isPS 
)
static

Rescan original content for the licenses already found.

Parameters
textpOriginal text string
isFileMarkupLanguageIs original text a markup text
isPSIs original text a PostScript text

Definition at line 948 of file licenses.c.

static void saveLicenseData ( scanres_t scores,
int  nCand,
int  nElem,
int  lowWater 
)
static

Save/creates all the license-data in a specific directory temp directory?

Note
OF SPECIAL INTEREST: this function changes directory!
Todo:

CDB - Some initializations happen here for no particular reason

we should filter some names out like the shellscript does. For instance, word-spell-dictionary files will score high but will likely NOT contain a license. But the shellscript filters these names AFTER they're already scanned. Think about it.

BUG: When _FTYP_POSTSCR is "(postscript|utf-8 unicode)", the resulting license-parse yields 'NoLicenseFound' but when both "postscript" and "utf-8 unicode" are searched independently, parsing definitely finds quantifiable licenses. WHY?

Definition at line 994 of file licenses.c.

Here is the call graph for this function:

void scanForKeywordsAndSetScore ( scanres_t scores,
list_t licenseList 
)

For EACH file, determine if we want to scan it, and if so, scan the candidate files for keywords (to obtain a "score" – the higher the score, the more likely it has a real open source license in it).

There are lots of things that will 'disinterest' us in a file (below).

Parameters
scores
licenseList
Note
This loop is called 400,000 to 500,000 times when parsing a distribution. Little slow-downs ADD UP quickly!
Some other part of FOSSology has already decided we want to scan this file, so we need to look into removing this file scoring stuff.
Todo:
We don't currently use _UTIL_FILTER, which is set up to exclude some files by filename.

Definition at line 606 of file licenses.c.

static int scoreCompare ( const void *  arg1,
const void *  arg2 
)
static

Compare two scores.

Returns
-1 ; If score1 > score2
1 ; If score1 < score2
-1 ; If fullpath1 != NULL and follpath2 = NULL
1 ; If fullpath1 = NULL and follpath2 != NULL
; String comparison of fullpath if conditions above fails
Note
this procedure is a qsort callback that provides a REVERSE integer sort (highest to lowest)

Definition at line 821 of file licenses.c.

static int searchStrategy ( int  index,
char *  regex,
int  aboveCalc 
)
static
Note
This function should be called BEFORE the wild-card specifier =ANY= is converted to a REAL regex ".*" (e.g., before fixSearchString())!

ASSUME a "standard line-length" of 50 characters/bytes. That's likely too small, but err on the side of being too conservative.

Determining for the number of text-lines ABOVE involves finding out how far into the 'license footprint' the seed-word resides. ASSUME a standard line-length of 50 (probably too small, but we'll err on the side of being too conservative. If the seed isn't IN the regex, assume a generally-bad worst-case and search 2-3 lines above.

Determining for the number of text-lines BELOW involves finding out how long the 'license footprint' actually is, plus adding some fudge based on the number of wild-cards in the footprint.

Parameters
indexLicense index from Strings.in
regexregex to match for
aboveCalcSet to look above

Definition at line 285 of file licenses.c.