python - Find duplicate records in large text file -


i'm on linux machine (redhat) , have 11gb text file. each line in text file contains data single record , first n characters of line contains unique identifier record. file contains little on 27 million records.

i need verify there not multiple records same unique identifier in file. need perform process on 80gb text file solution requires loading entire file memory not practical.

read file line-by-line, don't have load memory.

for each line (record) create sha256 hash (32 bytes), unless identifier shorter.

store hashes/identifiers in numpy.array. compact way store them. 27 million records times 32 bytes/hash 864 mb. should fit memory of decent machine these days.

to speed access use first e.g. 2 bytes of hash key of collections.defaultdict , put rest of hashes in list in value. in effect create hash table 65536 buckets. 27e6 records, each bucket contain on average list of around 400 entries. mean faster searching numpy array, use more memory.

d = collections.defaultdict(list) open('bigdata.txt', 'r') datafile:     line in datafile:         id = hashlib.sha256(line).digest()         # or id = line[:n]         k = id[0:2]         v = id[2:]         if v in d[k]:             print "double found:", id         else:             d[k].append(v) 

Comments

Popular posts from this blog

linux - Does gcc have any options to add version info in ELF binary file? -

android - send complex objects as post php java -

charts - What graph/dashboard product is facebook using in Dashboard: PUE & WUE -